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Abstract 

This  report  summarizes  three  areas  of  research  investigated  under  the  Air  Force  grant:  (i)  balancing  exploration 
and  exploitation  when  performing  reinforcement  learning  in  POMDPs;  (if)  proper  sharing  of  information  when 
performing  POM  DP-based  reinforcement  learning  on  a  network,  and  (in)  topic  modeling  for  time-evolving  systems, 
with  the  latter  now  being  transitioned  to  cybersecurity. 

For  research  thrust  (i),  a  fundamental  objective  in  reinforcement  learning  is  the  maintenance  of  a  proper  balance 
between  exploration  and  exploitation.  This  problem  becomes  more  challenging  when  the  agent  can  only  partially 
observe  the  states  of  its  environment.  In  this  project  we  propose  a  dual-policy  method  for  jointly  learning  the 
agent  behavior  and  the  balance  between  exploration  exploitation,  in  partially  observable  environments.  The  method 
subsumes  traditional  exploration,  in  which  the  agent  takes  actions  to  gather  information  about  the  environment,  and 
active  learning,  in  which  the  agent  queries  an  oracle  for  optimal  actions  (with  an  associated  cost  for  employing 
the  oracle).  The  form  of  the  employed  exploration  is  dictated  by  the  specific  problem.  Theoretical  guarantees  are 
provided  concerning  the  optimality  of  the  balancing  of  exploration  and  exploitation.  The  effectiveness  of  the  method 
is  demonstrated  by  experimental  results  on  benchmark  problems. 

For  research  thrust  (ii\  the  Dinchlet  process  (DP)  has  proven  a  powerful  nonparametric  prior  in  multi-task 
reinforcement  learning  (MTRL).  A  drawback  of  the  DP  prior  is  that  it  either  encourages  global  clustering  based 
on  all  parameters,  or  it  encourages  independent  local  clustering  based  on  subsets  of  parameters  In  this  report  we 
generalize  the  MTRL  framework  by  employing  the  nonparametric  dependent  local  partition  process  (LPP)  as  a 
prior  to  promote  simultaneous  local  and  global  clustering.  We  provide  theoretical  analysis  of  the  correlated  local 
clustering  structure  induced  by  the  LPP  and  show'  the  structure  facilitates  information-sharing  between  partially 
similar  RL  tasks.  We  develop  the  LPP-based  MTRL  framework  assuming  the  environment  in  each  RL  task  is  an 
unknown  partially  observable  Markov  decision  process,  and  we  provide  experimental  results  to  demonstrate  the 
advantage  of  the  LPP-based  MTRL. 

For  research  thrust  (ii),  we  consider  the  problem  of  inferring  and  modeling  topics  in  a  sequence  of  documents 
with  known  publication  dates.  The  documents  at  a  given  time  are  each  characterized  by  a  topic,  and  the  topics  are 
drawn  from  a  mixture  model.  The  proposed  model  infers  the  change  in  the  topic  mixture  weights  as  a  function  of 
time.  The  details  of  this  general  framework  may  take  different  forms,  depending  on  the  specifics  of  the  model.  For 
the  examples  considered  here  we  examine  base  measures  based  on  independent  multinomial-Dirichlet  measures 
for  representation  of  topic-dependent  word  counts.  The  form  of  the  hierarchical  model  allows  efficient  variational 
Bayesian  (VB)  inference,  of  interest  for  large-scale  problems.  We  demonstrate  results  and  make  comparisons  to  the 
model  when  the  dynamic  character  is  removed,  and  also  compare  to  latent  Dirichlet  allocation  (LDA)  and  topics 
over  time  (TOT).  We  consider  a  database  of  NIPS  papers  as  well  as  the  United  States  presidential  State  of  the 
Union  addresses  from  1790  to  2008.  This  is  a  demonstration  of  the  technology,  which  is  now  being  transitioned 
to  time-evolving  data  from  a  computer  network. 


I.  Exploring  and  Exploiting  in  POMDPs 

A  fundamental  challenge  facing  reinforcement  learning  (RL)  algorithms  is  to  maintain  a  proper  balance 
between  exploration  and  exploitation.  The  policy  designed  based  on  previous  experiences  is  by  construction 
constrained,  and  may  not  be  optimal  as  a  result  of  inexperience.  Therefore,  it  is  desirable  to  take  actions 
with  the  goal  of  enhancing  experience.  Although  these  actions  may  not  necessarily  yield  optimal  near-term 
reward  toward  the  ultimate  goal,  they  could,  over  a  long  horizon,  yield  improved  long-term  reward.  The 
fundamental  challenge  is  to  achieve  an  optimal  balance  between  exploration  and  exploitation;  the  former 
is  performed  with  the  goal  of  enhancing  experience  and  preventing  premature  convergence  to  suboptimal 
behavior,  and  the  latter  is  performed  with  the  goal  of  employing  available  experience  to  define  perceived 
optimal  actions. 

For  a  Markov  decision  process  (MDP),  the  problem  of  balancing  exploration  and  exploitation  has 
been  addressed  successfully  by  the  E 3  [1],  [2]  and  R-max  [3]  algorithms.  Many  important  applications, 
however,  have  environments  whose  states  are  not  completely  observed,  leading  to  partially  observable 
MDPs  (POMDPs).  Reinforcement  learning  in  POMDPs  is  challenging,  particularly  in  the  context  of 
balancing  exploration  and  exploitation.  Recent  work  targeted  on  solving  the  exploration  vs.  exploitation 
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problem  is  based  on  an  augmented  POMDP,  with  a  product  state  space  over  the  environment  states  and  the 
unknown  POMDP  parameters  [4],  This,  however,  entails  solving  a  complicated  planning  problem,  which 
has  a  state  space  that  grows  exponentially  with  the  number  of  unknown  parameters,  making  the  problem 
quickly  intractable  in  practice.  To  mitigate  this  complexity,  active  learning  methods  have  been  proposed 
for  POMDPs,  which  borrow  similar  ideas  from  supervised  learning,  and  apply  them  to  selectively  query  an 
oracle  (domain  expert)  for  the  optimal  action  [5].  Active  learning  has  found  success  in  many  collaborative 
human-machine  tasks  where  expert  advice  is  available. 

In  this  report  we  propose  a  dual-policy  approach  to  balance  exploration  and  exploitation  in  POMDPs, 
by  simultaneously  learning  two  policies  with  partially  shared  internal  structure.  The  first  policy,  termed 
the  primary  policy ,  defines  actions  based  on  previous  experience;  the  second  policy,  termed  the  auxiliary 
policy ,  is  a  meta-level  policy  maintaining  a  proper  balance  between  exploration  and  exploitation.  We 
employ  the  regionalized  policy  representation  (RPR)  [6]  to  parameterize  both  policies,  and  perform 
Bayesian  learning  to  update  the  policy  posteriors.  The  approach  applies  in  either  of  two  cases:  (;')  the  agent 
explores  by  randomly  taking  the  actions  that  have  been  insufficiently  tried  before  (traditional  exploration), 
or  (//)  the  agent  explores  by  querying  an  oracle  for  the  optimal  action  (active  learning).  In  the  latter  case, 
the  agent  is  assessed  a  query  cost  from  the  oracle,  in  addition  to  the  reward  received  from  the  environment. 
Either  (/)  or  (ii)  is  employed  as  an  exploration  vehicle,  depending  upon  the  application. 

The  dual-policy  approach  possesses  interesting  convergence  properties,  similar  to  those  of  E3  [2]  and 
Rmax  [3].  However,  our  approach  assumes  the  environment  is  a  POMDP  while  E3  and  Rmax  both 
assume  an  MDP  environment.  Another  distinction  is  that  our  approach  learns  the  agent  policy  directly 
from  episodes,  without  estimating  the  POMDP  model.  This  is  in  contrast  to  E3  and  Rmax  (both  learn 
MDP  models)  and  the  active-learning  method  in  [5]  (which  learns  POMDP  models). 

II.  Regionalized  Policy  Representation 

We  first  provide  a  brief  review  of  the  regionalized  policy  representation,  which  is  used  to  parameterize 
the  primary  policy  and  the  auxiliary  policy  as  discussed  above.  The  material  in  this  section  is  taken  from 
[6],  with  the  proofs  omitted  here. 

Definition  2.1:  A  regionalized  policy  representation  is  a  tuple  (A.O.  Z.\\\ p.rr).  The  A  and  O  are 
respectively  a  finite  set  of  actions  and  observations.  The  Z  is  a  finite  set  of  belief  regions.  The  W  is 
the  belief-region  transition  function  with  \ V(z.a.o'.z')  denoting  the  probability  of  transiting  from  2  to  z' 
when  taking  action  a  in  2  results  in  observing  o'.  The  p  is  the  initial  distribution  of  belief  regions  with 
p(z)  denoting  the  probability  of  initially  being  in  2.  The  tt  are  the  region-dependent  stochastic  policies 
with  7r(2,a)  denoting  the  probability  of  taking  action  a  in  2. 

We  denote  A  =  {1, 2, . . . ,  |-4|},  where  \A\  is  the  cardinality  of  A.  Similarly,  O  =  (1.2 _ _  \0\}  and 

Z  =  {1.2 . \Z\}.  We  abbreviate  (a0.«i . aj)  as  a0;r  and  similarly,  (oi.o?, . . .  ,aT)  as  ol:j  and 

(20,  Z\ . zt)  as  Zq-j,  where  the  subscripts  indexes  discrete  time  steps.  The  history  h,  =  {ao;/-i ,  o\:t)  is 

defined  as  a  sequence  of  actions  performed  and  observations  received  up  to  t.  Let  ©  =  {7 r,/x.  W]  denote 
the  RPR  parameters.  Given  ht ,  the  RPR  yields  a  joint  probability  distribution  of  20  <  and  ao-.t  as  follows 

p(a 0:t,  Z0u\Ol,u  ©)  =  M2o)7r(20>  ao)Y)r=lW  iZr-\ '  Qr-1 »  °r-  *rM«r.  <©)  (1  ) 

By  marginalizing  zo:t  out  in  (11),  wre  obtain  p(ao:t\oi:t.Q).  Furthermore,  the  history-dependent  distribution 
of  action  choices  is  obtained  as  follows: 

p(aT\hT.  0)  =  p(an:T|oi:T.©)[p(aU:r_i|oi;T_i,©)]_1 

which  gives  a  stochastic  policy  for  choosing  the  action  aT.  The  action  choice  depends  solely  on  the 
historical  actions  and  observations,  with  the  unobservable  belief  regions  marginalized  out. 
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A.  Learning  Criterion 

Bayesian  learning  of  the  RPR  is  based  on  the  experiences  collected  from  the  agent-environment 
interaction.  Assuming  the  interaction  is  episodic,  i.e.,  it  breaks  into  subsequences  called  episodes  [7], 
we  represent  the  experiences  by  a  set  of  episodes. 

Definition  2.2:  An  episode  is  a  sequence  of  agent-environment  interactions  terminated  in  an  absorbing 
state  that  transits  to  itself  with  zero  reward.  An  episode  is  denoted  by  (a^roO,aj'rf  •  •  •  where 

the  subscripts  are  discrete  times,  k  indexes  the  episodes,  and  o,  a,  and  r  are  respectively  observations, 
actions,  and  immediate  rewards. 

Definition  2.3:  (The  RPR  Optimality  Criterion )  Let  V<h^  =  {(aoro°iairi  ‘  ’  °Tfcar/trrfc)}fc=i  be  a  set 
of  episodes  obtained  by  an  agent  interacting  with  the  environment  by  following  policy  fl  to  select  actions, 
where  II  is  an  arbitrary  stochastic  policy  with  action-selecting  distributions  pu(at\ht)  >  0,  V  action  at,  V 
history  ht.  The  RPR  optimality  criterion  is  defined  as 


V{ViK)-  0)  ='•  j,  Yjto 


nr=0Pn(Qrl^r) 


(2) 


where  h\  =  OgOjaf  -  of  is  the  history  of  actions  and  observations  up  to  time  t  in  the  k- th  episode, 
0  <  *)  <  1  is  the  discount,  and  ©  denotes  the  RPR  parameters. 

Throughout  the  report,  we  call  V(V<h,\Q)  the  empirical  value  function  of  0.  It  is  proven  in  [6]  that 
lim/C_oo  V(D(-K'1: 0)  is  the  expected  sum  of  discounted  rewards  by  following  the  RPR  policy  parameterized 
by  0  for  an  infinite  number  of  steps.  Therefore,  the  RPR  resulting  from  maximization  of  V(V{K '\Q) 
approaches  the  optimal  as  K  is  large  (assuming  \Z\  is  appropriate).  In  the  Bayesian  setting  discussed 
below,  we  use  a  noninformative  prior  for  0,  leading  to  a  posterior  of  0  peaked  at  the  optimal  RPR, 
therefore  the  agent  is  guaranteed  to  sample  the  optimal  or  a  near-optimal  policy  with  overwhelming 
probability. 


B.  Bayesian  Learning 

Let  Go(0)  represent  the  prior  distribution  of  the  RPR  parameters.  We  define  the  posterior  of  ©  as 

p(0|DwsGo)  d=  V(V{K):Q)Go(Q){V(V{K))}-1  (3) 

where  V(Z>W)  =  f  V(V<K- :  0)Go(0)d@  is  the  marginal  empirical  value.  Note  that  V(Z>(A|;0)  is  an 
empirical  value  function,  thus  (13)  is  a  non-standard  use  of  Bayes  rule.  However,  (13)  indeed  gives  a 
distribution  whose  shape  incorporates  both  the  prior  and  the  empirical  information. 

Since  each  term  in  V(V^K]\Q)  is  a  product  of  multinomial  distributions,  it  is  natural  to  choose  the 
prior  as  a  product  of  Dirichlet  distributions, 


Go(6)  =  p(p\v)p(Tt\p)p(W\uj)  (4) 

where  p(/i|u)  =  Dir(p(l).  ,p(\Z\)\v),  p( ir|p)  =  n!f]Dir(?r(i.  1). ■  •  •  ,n(i.  |.4|)  p,), 

?(H'M  =  nu,  n2.  nfflwrw*- «.*•«••••  a  =  <*»>*,. » -  wa. 

and  uilM.o  =  {^i.a.o j } j=i  are  hyper-parameters.  With  the  prior  thus  chosen,  the  posterior  in  (13)  is  a 
large  mixture  of  Dirichlet  products,  and  therefore  posterior  analysis  by  Gibbs  sampling  is  inefficient. 
To  overcome  this,  we  employ  the  variational  Bayesian  technique  [8]  to  obtain  a  variational  posterior  by 
maximizing  a  lower  bound  to  In  f  V(T>,K);O)G0(9)dQ, 

LBU?f  }.s(e))  =  lnJv(VIKhe)Go(Bcie  -  KL({9f(20*;,)9(e)}||{p'p(41,e|a‘:,,oJ„)}) 

where  {^},^(0)  are  variational  distributions  satisfying  q£(zi):t)  >  1,  <?(0)  >  1,  / g(€>)dQ  =  1,  and 

£Z?=1  Et=i  9?(4f)  =  i;  •'t  =  and  KL(9,|p)  denotes  the  Ku,,back- 

Leibler  (KL)  distance  between  probability  measure  q  and  p. 
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The  factorized  form  {qt(zo:t)g(Q)}  represents  an  approximation  of  the  weighted  joint  posterior  of  0 
and  z’s  when  the  lower  bound  reaches  the  maximum,  and  the  corresponding  g(Q)  is  called  the  variational 
approximate  posterior  of  0.  The  lower  bound  maximization  is  accomplished  by  solving  {<ft(2ot)}  and 
p(0)  alternately,  keeping  one  fixed  while  solving  for  the  other.  The  solutions  are  summarized  in  Theorem 
2.4;  the  proof  is  in  [6], 

Theorem  2.4:  Given  the  initialization  p  =  p,  v  —  v,  tD  =  u>,  iterative  application  of  the  following 
updates  produces  a  sequence  of  monotonically  increasing  lower  bounds  LB( {<?*  }-  ^(0) ),  which  converges 
to  a  maxima.  The  update  of  { q is 


qkz(4:t)  =  ^tP{4:t\4:V°hfQ) 


where  0  =  {7 f,//,  IV  }  is  a  set  of  under-normalized  probability  mass  functions,  with 
n(i,m)  =  =  e^(Ci)-v(Elii  s<))  and  W(i.a,o,j)  =  an<j 

ip  is  the  digamma  function.  The  g(&)  has  the  same  form  as  the  prior  Go  in  (14).  except  that  the  hyper¬ 
parameter  are  updated  as 

Vi  =  v,  + 


Pi, a  ~  Pi,a  +  Hfc=l  Sl=0  Sr=Oat °) 


Tk 


—  Wj.Oi0j-f  Sfc=l5Zt=0^Zr=l<Tf^tr-l(?  :  a)^(°r’  o) 


'Tk 


2,a,OJ 


where  j)  =  p(zk  =  i,zk+l  =  j\a$:t.okut,&),  $<r(i)  =  p(zk  =  i\a^okUvQ),  and 


<7?  =  [yrfp(4,|o?a,  ©)]  [nt_o  p"(«J  1  >  vce>(^>  ie  >] 


(5) 


III.  Dual-RPR 

Assume  that  the  agent  uses  the  RJPR  described  in  Section  VIII  to  govern  its  behavior  in  the  un¬ 
known  POMDP  environment  (the  primary  policy).  Bayesian  learning  employs  the  empirical  value  function 
V(V[h  >:  0)  in  (12)  in  place  of  a  likelihood  function,  to  obtain  the  posterior  of  the  RPR  parameters  0.  The 
episodes  V^h '  may  be  obtained  from  the  environment  by  following  an  arbitrary  stochastic  policy  IT  with 
pn(a|ft)  >  0,  V  a,  V  h.  Although  any  such  IT  guarantees  optimality  of  the  resulting  RPR,  the  choice  of  II 
affects  the  convergence  speed.  A  good  choice  of  IT  avoids  episodes  that  do  not  bring  new  information  to 
improve  the  RPR,  and  thus  the  agent  does  not  have  to  see  all  possible  episodes  before  the  RPR  becomes 
optimal. 

In  batch  learning,  all  episodes  are  collected  before  the  learning  begins,  and  thus  II  is  pre-chosen  and 
does  not  change  during  the  learning  [6].  In  online  learning,  however,  the  episodes  are  collected  during  the 
learning,  and  the  RPR  is  updated  upon  completion  of  each  episode.  Therefore  there  is  a  chance  to  exploit 
the  RPR  to  avoid  repeated  learning  in  the  same  part  of  the  environment.  The  agent  should  recognize  belief 
regions  it  is  familiar  with,  and  exploit  the  existing  RPR  policy  there;  in  belief  regions  inferred  as  new, 
the  agent  should  explore.  This  balance  between  exploration  and  exploitation  is  performed  with  the  goal 
of  accumulating  a  large  long-run  reward. 

We  consider  online  learning  of  the  RPR  (as  the  primary  policy)  and  choose  IT  as  a  mixture  of  twfo 
policies:  one  is  the  current  RPR  0  (exploitation)  and  the  other  is  an  exploration  policy  II,..  This  gives  the 
action-choosing  probability  pn(a|/i)  =  p(y  =  0\h)p(a\h.Q,y  =  0)  +  p(y  —  l|/?)p(a|/i.  IIe.  y  —  1),  where 
y  =  0  (y  =  1)  indicates  exploitation  (exploration).  The  problem  of  choosing  good  n  then  reduces  to  a 
proper  balance  between  exploitation  and  exploration:  the  agent  should  exploit  0  when  doing  so  is  highly 
rewarding,  while  following  IIe  to  enhance  experience  and  improve  0. 

An  auxiliary  RPR  is  employed  to  represent  the  policy  for  balancing  exploration  and  exploitation,  i.e.,  the 
history-dependent  distribution  p(y\h).  The  auxiliary  RPR  shares  the  parameters  {p,  W}  with  the  primary 
RPR,  but  with  7r  =  {7r(z,a)  :  a  e  A,  z  6  2}  replaced  by  A  =  {A (z.  y)  :  y  =  0  or  1,2  €  2},  where 
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\{z.  y)  is  the  probability  of  choosing  exploitation  ( y  =  0)  or  exploration  ( y  —  1)  in  belief  region  z.  Let 
A  have  the  prior 


p(\\u)  =  nl=iBeta^A(i.O).A(i,  1) 


w  i 


)• 


(6) 


In  order  to  encourage  exploration  when  the  agent  has  little  experience,  we  choose  w0  =  1  and  U]  >  1  so 
that,  at  the  beginning  of  learning,  the  auxiliary  RPR  always  suggests  exploration.  As  the  agent  accumulates 
episodes  of  experience,  it  comes  to  know  a  certain  part  of  the  environment  in  which  the  episodes  have 
been  collected.  This  knowledge  is  reflected  in  the  auxiliary  RPR.  which,  along  with  the  primary  RPR,  is 
updated  upon  completion  of  each  new  episode. 

Since  the  environment  is  a  POMDP,  the  agent’s  knowledge  should  be  represented  in  the  space  of 
belief  states.  However,  the- agent  cannot  directly  access  the  belief  states,  because  computation  of  belief 
states  requires  knowing  the  true  POMDP  model,  which  is  not  available.  Fortunately,  the  RPR  formulation 
provides  a  compact  representation  of  H  =  {h},  the  space  of  histories,  where  each  history  h  corresponds 
to  a  belief  state  in  the  POMDP.  Within  the  RPR  formulation,  H  is  represented  internally  as  the  set  of 
distributions  over  belief  regions  z  €  Z,  which  allows  the  agent  to  access  H  based  on  a  subset  of  samples 
from  H.  Let  Tfknown  be  the  part  of  H  that  has  become  known  to  the  agent,  i.e.,  the  primary  RPR  is 
optimal  in  H known  and  thus  the  agent  should  begin  to  exploit  upon  entering  Tfknown-  As  will  be  clear 
below,  Ttknown  can  be  identified  by  Tfknown  =  {/?  '•  p{y  =  O|/i.0.A)  ^  1},  if  the  posterior  of  A  is  updated 
by 

w,.o  =  +  Lt=i  Ylt=o  (7) 

wu  =  max  (??.  tii  -  £fe=i  E£o  T!r=oyhic<PtA‘>))  •  (8) 


where  ?/  is  a  small  positive  number,  and  of  is  the  same  in  (5)  except  that  rf  is  replaced  by  mf,  the 
meta-reward  received  at  t  in  episode  k.  We  have  mf  =  rmeta  if  the  goal  is  reached  at  time  t  in  episode 
k ,  and  mf  =  0  otherwise,  where  rmeta  >  0  is  a  constant.  When  ne  is  provided  by  an  oracle  (active 
learning),  a  query  cost  c  >  0  is  taken  into  account  in  (8),  by  subtracting  c  from  U\.  Thus,  the  probability 
of  exploration  is  reduced  each  time  the  agent  makes  a  query  to  the  oracle  (i.e.,  yf  =  1).  After  a  certain 
number  of  queries,  ttu  becomes  the  small  positive  number  r?  (it  never  becomes  zero  due  to  the  max 
operator),  at  which  point  the  agent  stops  querying  in  belief  region  2  =  i. 

In  (7)  and  (8),  exploitation  always  receives  a  “credit”,  while  exploration  never  receives  credit  (explo¬ 
ration  is  actually  discredited  when  IIP  is  an  oracle).  This  update  makes  sure  that  the  chance  of  exploitation 
monotonically  increases  as  the  episodes  accumulate.  Exploration  receives  no  credit  because  it  has  been 
pre-assigned  a  credit  (ui)  in  the  prior,  and  the  chance  of  exploration  should  monotonically  decrease  with 
the  accumulation  of  episodes.  The  parameter  u\  represents  the  agent’s  prior  for  the  amount  of  needed 
exploration.  When  c  >  0,  u\  is  discredited  by  the  cost  and  the  agent  needs  a  larger  u\  (than  when  c  =  0) 
to  obtain  the  same  amount  of  exploration.  The  fact  that  the  amount  of  exploration  monotonically  increases 
with  u\  implies  that,  one  can  always  find  a  large  enough  u\  to  ensure  that  the  primary  RPR  is  optimal  in 
known  =  :  p(y  =  0|/i.@,  A)  %  1}.  However,  an  unnecessarily  large  u\  makes  the  agent  over-explore 

and  leads  to  slow'  convergence.  Let  u"un  denote  the  minimum  Uj  that  ensures  optimality  in  H known-  We 
assume  u™'n  exists  in  the  analysis  below.  The  possible  range  of  ?t™n  is  examined  in  the  experiments. 


IV.  Optimality  and  Convergence  Analysis 


Let  M  be  the  true  POMDP  model.  We  first  introduce  an  equivalent  expression  for  the  empirical  value 
function  in  (12), 

V(Sjh):Q)  =  Yj=o^trtP(ao-t,ol:t.rt\y0:t  =  0,0.  M),  (9) 

where  the  first  summation  is  over  all  elements  in  £-^  C  £j,  and  £t  =  {(ao:T- °i:T- r0:r)  :  (it  €  A.  ot  £ 
O.  t  =  0. 1.  •  •  •  .  T}  is  the  complete  set  of  episodes  of  length  T  in  the  POMDP,  with  no  repeated  elements. 


'W 
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The  condition  yo-.t  =  0,  which  is  an  an  abbreviation  for  yT  =  0  V  r  =  0. 1,  •  •  •  .  t,  indicates  that  the  agent 
always  follows  the  RPR  (0)  here.  Note  V(£^  :  0)  is  the  empirical  value  function  of  0  defined  on  £^\ 
as  is  V{Vth'\9)  on  2>A  .  When  T  —  oc  the  two  are  identical  up  to  a  difference  in  acquiring  the 
episodes:  4A  is  a  simple  enumeration  of  distinct  episodes  while  P(A|  may  contain  identical  episodes. 
The  multiplicity  of  an  episode  in  T>lh  results  from  the  sampling  process  (by  following  a  policy  to  interact 
with  the  environment).  Note  that  the  empirical  value  function  defined  using  £7A  is  interesting  only  for 
theoretical  analysis,  because  the  evaluation  requires  knowing  the  true  POMDP  model,  not  available  in 
practice.  We  define  the  optimistic  value  function 


T  1 

4(4'°:  e.  a,  n.)  =££>■  £>,+(/?  max  rt)^T=o  yr)p(ao:t,ol:t,rt,y0:i\e.\M,ne)  (10) 

pW  *=0  yo--->yt=0 

where  V$.=0iyT  indicates  that  the  agent  receives  rt  if  and  only  if  yT  =  0  at  all  time  steps  r  =  1,  2.  •  •  •  .  f, 
otherwise,  it  receives  i?max  at  t,  which  is  an  upper  bound  of  the  rewards  in  the  environment.  Similarly 
we  can  define  V(V[hi;9,  A,  np),  the  equivalent  expression  for  V)(£TA  :0.A.ne).  The  following  lemma 
is  proven  in  the  Appendix. 

Lemma  4.1:  LetV(4K  :0),  V/(£TA  ;  0.  A.  ne),  and  i?max  be  defined  as  above.  Let  Pexipore(£’^A*,  0.  A.  ne) 
be  the  probability  of  executing  the  exploration  policy  np  at  least  once  in  some  episode  in  4  >  under  the 
auxiliary  RPR  (0.A)  and  the  exploration  policy  np.  Then 


Tpxlpore  ( 


(*) 


,0.A.nP)  >  ^^lf'(£^0)-fy4^0.A.nP)|. 


R 


Proposition  4.2:  Let  0  be  the  optimal  RPR  on  £ if  and  0*  be  the  optimal  RPR  in  the  complete 
POMDP  environment.  Let  the  auxiliary  RPR  hyper-parameters  (A)  be  updated  according  to  (7)  and  (8), 
with  u\  >  u™n-  Let  np  be  the  exploration  policy  and  e  >  0.  Then  either  (a)  V(8oo> 0)  >  V(£oo’i  0**)  —  e, 
or  (b)  the  probability  that  the  auxiliary  RPR  suggests  executing  IIe  in  some  episode  unseen  in  £ ^  is  at 
least  4^4 

TCmax  ^ 

Proof:  It  is  sufficient  to  show  that  if  (a)  does  not  hold,  then  (b)  must  hold.  Let  us  assume  VT^;©)  < 


LX&c;©*)  —  e.  Because  0  is  optimal  in  £(J?  ,  l/(£^':0)  >  ^(£^^,0*),  which  implies  V(£^K,Q)  < 
V(S&K);e*)  -  e.  where  £^K)  =  £x  \  £{“\  We  show  below  that  Vf(£&K)-,  0,  A.  ne)  >  V(£&K];9*) 
which,  together  with  Lemma  4.1,  implies 


Rk)., 


nxlPore(£,LW).0.A,np)  > 

> 


1  ~7 

R  max 

1  —  7 

Rmax 


Vf{£^K):  0.  A.  np)  -  V(£^K):  0) 


V{££k);9*)  -  i>(4K,:0) 


> 


e(l  ~  7) 

Rmpix 


We  now  show  V/(£»K^;  ©,  A,ne)  >  V(£^h>:  0*).  By  construction,  Vf(S^K  ;0.A.Ile)  is  an  optimistic 
value  function,  in  which  the  agent  receives  i?max  at  any  time  t  unless  if  yT  =  0  at  r  =  0, 1,  •  •  •  ,  t.  However, 
yT  =  0  at  r  =  0. 1,  •  •  •  ,t  implies  that  {hT  :  r  =  0, 1, •••,<}  C  Hv.nov.-n-,  By  the  premise,  A  is  updated 
according  to  (7)  and  (8)  and  14  >  uf1D,  therefore  0  is  optimal  in  Hvnovn  (see  the  discussions  following 
(7)  and  (8)),  which  implies  9  is  optimal  in  {hT  :  r  —  0, 1,  •  •  •  ,t}.  Thus,  the  inequality  holds.  Q.E.D. 

Proposition  4.2  shows  that  whenever  the  primary  RPR  achieves  less  accumulative  reward  than  the 
optimal  RPR  by  e,  the  auxiliary  RPR  suggests  exploration  with  a  probability  exceeding  e(l  —  7)Rm‘ax- 
Conversely,  whenever  the  auxiliary  RPR  suggests  exploration  with  a  probability  smaller  than  e(l  -7)i?“^x, 
the  primary  RPR  achieves  e-near  optimality.  This  ensures  that  the  agent  is  either  receiving  sufficient 
rewards  or  it  is  performing  sufficient  exploration. 


'An  episode  almost  always  terminates  in  finite  time  steps  in  practice  and  the  agent  stays  in  the  absorbing  state  with  zero  reward  for  the 
remaining  infinite  steps  after  an  episode  is  terminated  [7],  The  infinite  horizon  is  only  to  ensure  theoretically  all  episodes  have  the  same 
horizon  length 
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V.  Experimental  Results 

Our  experiments  are  based  on  Shuttle,  a  benchmark  POMDP  problem  [9],  with  the  following  setup. 
The  primary  policy  is  a  RPR  with  \Z\  =  10  and  a  prior  in  (14),  with  all  hyper-parameters  initially  set  to 
one  (which  makes  the  initial  prior  non-informative).  The  auxiliary  policy  is  a  RPR  sharing  {/i,  W}  with 
the  primary  RPR  and  having  a  prior  for  A  as  in  (6).  The  prior  of  A  is  initially  biased  towards  exploration 
by  using  u0  =  1  and  u \  >  1.  We  consider  various  values  of  ui  to  examine  the  different  effects.  The 
agent  performs  online  learning:  upon  termination  of  each  new  episode,  the  primary  and  auxiliary  RPR 
posteriors  are  updated  by  using  the  previous  posteriors  as  the  current  priors.  The  primary  RPR  update 
follows  Theorem  2.4  with  K  =  1  while  the  auxiliary  RPR  update  follows  (7)  and  (8)  for  A  (it  shares  the 
same  update  with  the  primary  RPR  for  ^  and  W).  We  perform  100  independent  Monte  Carlo  runs.  In 
each  run,  the  agent  starts  learning  from  a  random  position  in  the  environment  and  stops  learning  when 
A't(Uai  episodes  are  completed.  We  compare  various  methods  that  the  agent  uses  to  balance  exploration 
and  exploitation:  (/)  following  the  auxiliary  RPR,  with  various  values  of  u\,  to  adaptively  switch  between 
exploration  and  exploitation;  (//)  randomly  switching  between  exploration  and  exploitation  with  a  fixed 
exploration  rate  Pexpiore  (various  values  of  Pexpiore  are  examined).  When  performing  exploitation,  the 
agent  follows  the  current  primary  RPR  (using  the  0  that  maximizes  the  posterior);  when  performing 
exploration,  it  follows  an  exploration  policy  Ile.  We  consider  two  types  of  I1P:  (/)  taking  random  actions 
and  (//)  following  the  policy  obtained  by  solving  the  true  POMDP  using  PBVI  [10]  with  2000  belief 
samples.  In  either  case,  rmeta  =  1  and  rj  =  0.001.  In  case  (/'/),  the  PBVI  policy  is  the  oracle  and  incurs  a 
query  cost  c. 

We  report;  (/)  the  sum  of  discounted  rewards  accrued  within  each  episode  during  learning;  these  rewards 
result  from  both  exploitation  and  exploration,  (ii)  the  quality  of  the  primary  RPR  upon  termination  of  each 
learning  episode,  represented  by  the  sum  of  discounted  rewards  averaged  over  251  episodes  of  following 
the  primary  RPR  (using  the  standard  testing  procedure  for  Shuttle:  each  episode  is  terminated  when  either 
the  goal  is  reached  or  a  maximum  of  251  steps  is  taken);  these  rewards  result  from  exploitation  alone. 
(/'//')  the  exploration  rate  Pexpiore  in  each  learning  episode,  which  is  the  number  of  time  steps  at  which 
exploration  is  performed  divided  by  the  total  time  steps  in  a  given  episode.  In  order  to  examine  the 
optimality,  the  rewards  in  (/)-(//)  has  the  corresponding  optimal  rewards  subtracted,  where  the  optimal 
rewards  are  obtained  by  following  the  PBVI  policy;  the  difference  are  reported,  with  zero  difference 
indicating  optimality  and  minus  difference  indicating  sub-optimality.  All  results  are  averaged  over  the  100 
Monte  Carlo  runs.  The  results  are  summarized  in  Figure  1  when  IIe  takes  random  actions  and  in  Figure 
2  when  IT  is  an  oracle  (the  PBVI  policy). 
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Fig.  I  Resulls  on  Shuttle  with  a  random  exploration  policy,  with  Ktot&\  =  3000.  Left:  accumulative  discounted  reward 
accrued  within  each  learning  episode,  with  the  corresponding  optimal  reward  subtracted.  Middle:  accumulative  discounted 
rewards  averaged  over  251  episodes  of  following  the  primary  RPR  obtained  after  each  learning  episode,  again  with  the 
corresponding  optimal  reward  subtracted.  Right:  the  rate  of  exploration  in  each  learning  episode.  All  results  are  averaged  over 
100  independent  Monte  Carlo  runs. 


It  is  seen  from  Figure  1  that,  with  random  exploration  and  u\  =  2,  the  primary  policy  converges  to 
optimality  and.  accordingly,  Pexpiore  drops  to  zero,  after  about  1500  learning  episodes.  When  u\  increases 
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Fig.  2.  Results  on  Shuttle  with  an  oracle  exploration  policy  incurring  cost  c  =  1  (top  row)  and  c  =  3  (bottom  row),  and 
A'totai  =  100.  Each  figure  in  a  row  is  a  counterpart  of  the  corresponding  figure  in  Figure  1.  w  ith  the  random  IIe  replaced  by 
the  oracle  n, .  See  the  captions  there  for  details 


to  20,  the  convergence  is  slower:  it  does  not  occur  (and  Pex piore  >  0)  until  after  abound  2500  learning 
episodes.  With  u\  increased  to  200,  the  convergence  does  not  happen  and  Pexpi0re  >  0-2  within  the  first 
3000  learning  episodes.  These  results  verify  our  analysis  in  Section  III  and  IV:  (/)  the  primary  policy 
improves  as  Pexpiore  decreases;  (ii)  the  agent  explores  when  it  is  not  acting  optimally  and  it  is  acting 
optimally  when  it  stops  exploring;  (iii)  there  exists  finite  iq  such  that  the  primary  policy  is  optimal  if 
Pexpiore  =  0.  Although  Ui  =  2  may  still  be  larger  than  u™n,  it  is  small  enough  to  ensure  convergence 
within  1500  episodes.  We  also  observe  from  Figure  1  that:  (/)  the  agent  explores  more  efficiently  when  it 
is  adaptively  switched  between  exploration  and  exploitation  by  the  auxiliary  policy,  than  when  the  switch 
is  random;  (ii)  the  primary  policy  cannot  converge  to  optimality  when  the  agent  never  explores;  (iii)  the 
primary  policy  may  converge  to  optimality  when  the  agent  always  takes  random  actions,  but  it  may  need 
infinite  learning  episodes  to  converge 

The  results  in  Figure  2,  with  EE  being  an  oracle,  provide  similar  conclusions  as  those  in  Figure  1  when 
n,  is  random.  However,  there  are  two  special  observations  from  Figure  2:  (/)  Pexpiore  is  affected  by  the 
query  cost  c:  with  a  larger  c,  the  agent  performs  less  exploration,  (ii)  the  convergence  rate  of  the  primary 
policy  is  not  significantly  affected  by  the  query  cost.  The  reason  for  (ii)  is  that  the  oracle  always  provides 
optimal  actions,  thus  over-exploration  does  not  harm  the  optimality;  as  long  as  the  agent  takes  optimal 
actions,  the  primary  policy  continually  improves  if  it  is  not  yet  optimal,  or  it  remains  optimal  if  it  is 
already  optimal. 


VI.  Summary  of  Balancing  Exploration  &  Exploitation 

We  have  presented  a  dual-policy  approach  for  jointly  learning  the  agent  behavior  and  the  optimal  balance 
between  exploitation  and  exploration,  assuming  the  unknown  environment  is  a  POMDP.  By  identifying  a 
known  part  of  the  environment  in  terms  of  histones  (parameterized  by  the  RPR),  the  approach  adaptively 
switches  between  exploration  and  exploitation  depending  on  whether  the  agent  is  in  the  known  part. 
We  have  provided  theoretical  guarantees  for  the  agent  to  either  explore  efficiently  or  exploit  efficiently. 
Experimental  results  show  good  agreement  with  our  theoretical  analysis  and  that  our  approach  finds  the 
optimal  policy  efficiently.  Although  we  empirically  demonstrated  the  existence  of  a  small  U\  to  ensure 
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efficient  convergence  to  optimality,  further  theoretical  analysis  is  needed  to  find  t/™n,  the  tight  lower  bound 
of  itj,  which  ensures  convergence  to  optimality  with  just  the  right  amount  of  exploration  (without  over- 
exploration).  Finding  the  exact  u™'n  is  difficult  because  of  the  partial  observability.  However,  it  is  hopeful 
to  find  a  good  approximation  to  it™n.  In  the  worst  case,  the  agent  can  always  choose  to  be  optimistic, 
like  in  and  Rmax.  An  optimistic  agent  uses  a  large  uu  which  usually  leads  to  over-exploration  but 
ensures  convergence  to  optimality. 


Appendix 


Proof  of  Lemma  4.1:  We  expand  (10)  as. 

Vf(S'T1'  :  ©.  A.  nc)  =  £f(K)  HLoT'tr*P(aO:‘’0l-«>r*lj/0:«  =  0.©.A/)p(j/0:,  =  O|0.  A) 

+Y.£!rK>  Ef=0  7* -Rmax  ZVH^oP(°0:t.Ol, •  rf  1 2/0:f  •  ©•  M,  Hr )p( V0:t  I©.  A) 

where  y0j  is  an  an  abbreviation  for  yT  =  0  V  r  =  0.  •  -  •  .  t  and  y0:t  ±  0  is  an  an  abbreviation  for  3  0  <  r  <  t  satisfying 
yT  0.  The  sum  is  over  all  episodes  in  £jh  The  difference  between  (9)  and  (11)  is 


!p(f|A’.©)  -  V(£j  ;©.A)|  =  Xjf=o7trfP(ao:(-°i:f.  r(|po.r  =  0,©.Af)(l  -  p{y0:,  =  0|©.A)) 

~^2£^!<)  Ef= 0  7 f  Rmax  Eyo :i  5*0 P(a0.f-  °1  t  •  rt  |P0.1  •  A/.  ric)p(  Po:t  l©>  A)  j 

=  Ee<*  Tj=o~>trtP(ao<-°it-rt\yo,  =  O.©,A/)Xjy,)(*oP(Po:t|0.A) 

— Ef^/0  Et=0  7*  Rmax  Ey0:(  ^oP(a° :t  ’  r>  lUO-t-  ©.A/.  nc)p(po:t  (©•  A) 


=  XI  [p(ao:t-Oi:t.ni2/0;t  =0.©.A/)  -  ^^p(aot.oi.,.r,|po:(-©.  A/.  ne)  p(p0t|©.A) 


c-t/O  f  =0 

cT 


yut^o 


—  IZtssO  7*^max  Syo:t^oP(^0:t  —  SglK>  —  PiVo.t  —  0|@,  A)) 

-  E^*>U  -p(yO.T  —  0|©.  A))  5Z^=o7tRraax  <  ~  ™a*Egt*>U  -  P(P0:T  =  0|Q.  A)) 


where  Tj  ,  's  a  sum  over  sequences  {yo.t  '•  3  0  <  r  <  t  satisfying  yT  ^  0}. 


Q.E.D. 


VII.  Networked  POMDPs  and  Sharing  Information 

Reinforcement  learning  (RL)  typically  requires  a  large  quantity  of  tnal-and-error  searches  (data)  to 
discover  the  long-term  consequences  of  actions  in  an  unknown  dynamic  environment  [7],  When  the 
environment  is  not  fully  observable,  the  situation  becomes  more  severe,  since  the  agent  needs  more  data 
to  reason  about  the  state  uncertainty  in  addition  to  the  consequences  of  each  state-action  pair.  Therefore, 
it  is  important  to  utilize  as  much  pnor  knowledge  as  possible  in  reinforcement  learning,  to  promote 
parsimony  in  data  usage. 

To  be  specific,  let  the  unknown  environment  be  characterized  by  a  partially  observable  Markov  decision 
process  (POMDP),  the  states  of  which  the  agent  infers  through  observations  that  are  probabilistically 
dependent  on  the  states.  This  gives  rise  to  the  belief  state,  a  probability  distribution  of  the  states  conditioned 
on  all  observed  data  up  to  the  moment.  The  belief  state  is  a  sufficient  statistic  summarizing  all  the 
information  required  to  make  the  decision  about  the  action  at  any  given  moment  [11],  To  compute  belief 
states,  however,  the  agent  must  assume  complete  knowledge  of  the  environment  (i.e.,  must  know  the 
underlying  POMDP  model),  which  is  not  available  by  the  assumption  of  reinforcement  learning. 

Methods  of  addressing  reinforcement  learning  in  POMDPs  are  generally  divided  into  two  categories: 
model-based  and  model-free  [12],  A  model-based  method  first  seeks  to  learn  the  underlying  POMDP 
model  of  the  dynamic  environment  in  question,  and  then  applies  POMDP  planning  algorithms  to  find  the 
optimal  policy.  A  model-free  method  directly  finds  the  policy,  avoiding  the  intermediate  step  of  learning 
the  underlying  POMDP. 
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As  pointed  out  in  [12],  model-free  methods  are  computationally  advantageous,  but  they  cannot  take 
advantage  of  prior  knowledge  about  the  environment,  as  their  model-based  counterparts  do.  The  latter  is 
true  because  it  is  generally  difficult  to  establish  exact  correspondence  between  a  POMDP  and  its  policy, 
and  therefore  the  knowledge  for  a  specific  POMDP  cannot  easily  be  transferred  into  the  knowledge  for 
its  policy.  The  difficulty,  however,  is  alleviated  in  multi-task  reinforcement  learning  (MTRL)  [13],  in 
which  one  is  interested  in  the  prior  knowledge  that  one  environment  is  similar  to  another.  This  type  of 
relational  knowledge  transfers  readily  from  POMDPs  to  their  optimal  policies,  because  similar  POMDPs 
will  accordingly  have  similar  optimal  policies.  The  key  is  then  to  infer  which  environments  are  similar 
and  how  many  clusters  (classes  of  environments)  are  present.  This  is  accomplished  in  [13]  by  using 
a  nonparametric  Dirichlet  process  (DP)  [14]  prior  imposed  on  the  policies  across  the  environments. 
With  the  experiences  from  multiple  environments,  the  DP  prior  encourages  the  environments  to  form 
appropriate  clusters,  so  that  data  are  shared  within  each  cluster  to  enhance  the  cumulative  information 
and  improve  policy  learning.  It  is  noteworthy  that  the  computational  advantages  of  model-free  methods 
are  magnified  in  the  MTRL  setting,  because  they  avoid  solving  a  POMDP  planning  problem  for  each 
cluster  of  environments,  repeatedly  whenever  the  clusters  are  updated. 

Model-free  methods  rely  on  an  appropriate  way  of  representing  the  policy,  based  directly  on  the  available 
observed  information.  The  MTRL  framework  in  [13]  is  based  on  the  regionalized  policy  representation 
(RPR),  proposed  there  to  yield  an  efficient  parametrization  for  the  conditional  distribution  of  action  choices 
given  historical  actions  and  observations.  The  RPR  is  amenable  to  a  Bayesian  formulation  and  the  Dirichlet 
process  prior  can  be  employed  to  promote  clustering  of  the  RL  tasks. 

A  drawback  of  the  DP  prior  is  that  it  either  encourages  global  clustering  based  on  the  complete  set  of 
parameters,  or  it  encourages  independent  local  clustering  based  on  disjoint  subsets  of  parameters;  however, 
it  does  not  encourage  an  appropriate  balance  of  both.  On  one  hand,  global  clustering  enforces  two  partly 
similar  tasks  to  either  share  information  inappropriately  or  not  share  information  at  all.  On  the  other 
hand,  independent  local  clustering  yields  unnecessary  local  clusters,  increasing  the  burden  on  data  usage. 
In  this  project  we  aim  to  address  this  problem  by  employing  a  nonparametric  dependent  local  partition 
process  (LPP)  [15]  in  place  of  the  DP.  A  major  advantage  arising  from  this  replacement  is  that  the  LPP 
allows  simultaneous  local  and  global  clustering,  and  therefore  it  provides  an  effective  vehicle  for  sharing 
information  between  partially  similar  tasks. 

This  aspect  of  the  project  has  two  major  contributions.  The  first  is  the  proposed  LPP-based  MTRL 
framework,  which  includes  the  DP-based  framework  in  [13]  as  a  special  case,  and  the  associated  learning 
algorithm  and  experimental  studies.  The  second  principal  contribution  is  the  theoretical  analysis  of  the 
LPP,  which  extends  the  results  in  [15]  from  two  subsets  of  the  parameters  to  an  arbitrary  number  of 
subsets.  Our  theoretical  analysis  provides  further  insights  into  the  LPP,  both  in  the  general  sense  and  for 
our  specific  problem.  In  addition,  we  also  provide  analysis  justifying  the  LPP  as  a  relational  prior  for 
model-free  RL. 


VIII.  Regionalized  Policy  Representation 

Definition  8.1:  [13]  A  regionalized  policy  representation  is  a  tuple  (A  O.  Z.  W,  p,  7r),  where  A,  O, 
and  Z  are  respectively  a  finite  set  of  actions,  observations,  and  belief  regions.  The  IT  is  the  belief-region 
transition  function,  with  W(z,a,o',z')  denoting  the  probability  of  transiting  from  2  to  z’  when  taking 
action  a  in  z  results  in  observing  o'.  The  p  is  the  initial  distribution  of  belief  regions,  with  p(z)  denoting 
the  probability  of  initially  being  in  2.  The  7 r  are  the  region-dependent  stochastic  policies,  with  7t(z.a) 
denoting  the  probability  of  taking  action  a  in  2. 

The  history  ht  =  {aO:t-i.0nt}  is  a  sequence  of  actions  performed  and  observations  received  up  to  t. 
Let  ©  =  {tr,  p, H7}  denote  the  set  of  RPR  parameters.  The  number  of  parameters  is  given  by  |0|  = 
M  +  l/t|  +  \W\  =  \Z\  +  |^4||Z|  -f  |A|0||Z|2.  The  RPR  expresses  the  joint  probability'  distribution  of  zo:t 
and  aon  as 


p{ao.t,  20:t|Ol:t,  0)  =  p(z0)7i(z0.  «0) 
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xnLil'/(~r-i-ar  uOT,zT)ir(zT,aT)  (11) 

By  marginalizing  z0:t  out  in  (11),  one  obtains  p{ao:t\oi,t.  6),  which  can  be  used  to  yield  p(at\ht.  ©). 

Assuming  episodic  agent-environment  interactions  [7],  the  RPR  is  learned  using  a  set  of  episodes, 
where  an  episode  of  length  T  is  denoted  by  (a^r^okakry  ■  •  •  o^a^r^),  with  k  the  index. 


Definition  8.2:  [13]  Let  V  h  =  {(ag'roOiafri  •  •  •  be  a  set  of  episodes  obtained  by  an 

agent  interacting  with  the  environment  by  following  policy  n  to  select  actions,  where  n  is  an  arbitrary 
stochastic  policy  with  action-selecting  distributions  pn(at|/it)  >  0,  V  action  au  V  history  ht.  The  empirical 
value  function  is  defined  as 


K  Tk 


k= 1  t= 0 


rir=0P(Qrl^r-Q) 

IIr=oPn(arl^r) 


(12) 


where  h k  =  a^okak  •  ■  •  ok  is  the  history  of  actions  and  observations  up  to  time  t  in  the  Ar-th  episode, 
0  <  7  <  1  is  the  discount,  and  0  denotes  the  RPR  parameters. 

It  is  proven  in  [13]  that,  as  K  — »  oc,  the  limit  of  V{V^h 0)  is  the  expected  sum  of  discounted  rewards 
by  following  the  RPR  policy  parameterized  by  0  for  an  infinite  number  of  steps. 

Let  Gn(0)  represent  the  prior  distribution  of  the  RPR  parameters.  The  posterior  of  ©  is  defined  as 


p(©|P^).Go)  *==' 


V  (T>fK);  0)Gp(0) 
f  V(T>(K):  ©)Gn(0)d0 


(13) 


where  V(V{K^)  =  fV(Vhl:  @)Go(@)cf0  is  the  marginal  empirical  value.  Since  each  term  in  V(V^:Q) 
is  a  product  of  multinomial  distributions,  it  is  natural  to  choose  the  prior  as  a  product  of  Dirichlet 
distributions, 


Go(0)  =  p{p\v)p(n\p)p{W\uj) 


(14) 


where  p{p\v)  =  Dir(/i(l).  •  •  •  ,p(\Z\)\v),  p(ir\p)  =  nl=iDir(ir(*.  !)*  *  •  *  •w(tM-4|) 
n£,  nlfiDir(W(i.o,o.l),.-..W(*\a,oJ2:|)|w^  pt  =  {Pi,n}^lv  v  = 

I Z I 

}  1=i  are  hyper-parameters. 


Pi),  p(W \u)  = 
{v<}!fi,  and  u >t 


n£, 


IX.  Reinforcement  Learning  in  Multiple  Environments 

We  consider  M  environments  indexed  by  m  =  1. 2,  •  •  •  .  M,  each  characterized  by  an  unknown  POMDP 
with  the  same  action  set  A  and  observation  set  O.  Though  the  environments  may  apparently  look  different 
from  each  other,  it  is  often  the  case  in  practice  that  they  fall  into  clusters  such  that  those  in  the  same  cluster 
share  fundamental  common  characteristics.  Assume  that,  from  each  environment  m,  we  have  collected 
a  set  of  episodes  denoted  as  Dlhm)  =  {(a™'kr™'ko™'ka™'kr™’k  ■  ■  •  where  a  subscript 

or  superscript  m  indicates  the  environment  from  which  the  episodes  originated. 

One  may  pursue  various  paradigms  to  learn  the  RPR  policies  for  the  M  environments.  At  one  extreme, 
one  may  perform  single-task  reinforcement  learning  (STRL),  i.e.,  employing  T){hm)  to  obtain  a  distinct 
RPR  policy  for  the  m-th  environment,  for  any  m  6  {1.2,- ••  .A/}.  At  the  other  extreme,  one  may 
aggregate  the  episodes  across  the  environments  to  form  a  pool  which  is  then  employed  to 

get  one  RPR  for  all  environments.  Clearly,  STRL  treats  the  environments  as  independent  to  each  other 
while  pool-based  reinforcement  learning  (PBRL)  treats  the  environments  as  identical. 

Between  the  two  extremes  is  MTRL,  in  which  one  partitions  the  environments  into  clusters  based  on  an 
appropriate  similarity  measure.  Given  the  partition,  one  performs  PBRL  within  each  cluster  or,  equivalently, 
performs  STRL  by  treating  each  cluster  as  a  task.  In  Bayesian  learning,  the  similarity  measure  is  implicitly 
prescribed  by  the  Bayesian  prior,  which  induces  probabilistic  task  clusters.  Different  Bayesian  priors  induce 
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different  task  clusters.  By  changing  the  priors,  one  obtains  a  wide  spectrum  of  MTRL  algorithms,  bridging 
the  gap  between  STRL  and  PBRL. 

The  success  of  MTRL  hinges  on  the  choice  of  the  Bayesian  prior.  The  key  is  that  the  similarity 
measure  prescribed  by  the  prior  should  distinguish  the  differences  between  tasks  while  being  able  to  also 
find  similarities  at  the  proper  level  of  fidelity.  In  other  words,  a  good  prior  should  provide  a  reasonable 
balance  between  capturing  the  common  characteristics  among  the  tasks  and  respecting  the  idiosyncracies 
of  each  individual  task.  This  motivates  use  of  a  dependent  local  partition  prior,  that  promotes  correlated 
local  task  clusters,  allows  a  flexible  similarity  pattern  that  accounts  for  common  as  well  as  idiosyncratic 
aspects  among  the  tasks,  and  thus  makes  the  information  sharing  more  efficient. 

X.  Multi-task  Reinforcement  Learning  via  Correlated  Local  Task  Clusters 

Introducing  notation,  we  let  ©m  denote  the  RPR  parameters  for  the  m-th  environment,  with  the  number 
of  belief  regions  \Z\  independent  of  rn.  Recalling  the  environments  have  the  same  A  and  O  and  |@| 
is  a  function  of  (|Z|,  |.4|,  |0|),  we  have  |0m|  =  dapR ,  where  is  constant.  We  further  assume  the 

elements  in  0m  follow  the  same  order  across  m  =  1, 2 . M.  Let  {1.  2.  •  •  •  .  c/rpr}  be  partitioned  into 

J  nonempty  disjoint  subsets  denoted  by  {2i.22.  •  •  •  .lj}.  Hence  2 )  indexes  the  j-th  part  of  an  RPR  in 
any  environment.  We  define  Qmj  as  the  subset  of  0m  such  that  the  elements  of  0my  are  indexed  by  J, 
in  0TO.  Thus  0m  is  accordingly  partitioned  into  J  disjoint  nonempty  subsets  {0mi.  •  •  •  ,0mj})  and  the 
consistency  among  the  partitions  in  different  environments  is  ensured  by  using  the  same  {2} } . 

The  partition  {0mj}  should  be  constructed  to  facilitate  local  information  sharing  among  the  envi¬ 
ronments.  For  example,  one  may  let  0ml  =  pm,  @m2  =  nm{z  =  1.:),  •••,  0m,|Z|+i  =  trm{z  = 
l-ZI,:),  ©m,|Z|+2  =  Wm(:,a  =  l,o  =  1,:),  •••  ,  emj  =  Wm(:,a  =  \A\.o  =  |0|,:),  where 
.7  =  |«4||0|  +  \Z\  +  1.  In  this  partition,  each  subset  of  parameters  play  a  specific  role  in  the  RPR 
policy,  for  example,  action  selection  in  a  particular  belief  region.  This  encourages  environments  to  share 
the  same  subset  of  RPR  parameters  when  they  have  similar  goal  states. 

A.  The  Dependent  Local  Partition  Prior 

With  an  appropriate  partition  of  RPR  parameters,  one  could  place  a  local  DP  prior  on  each  subset  in 
the  partition  to  encourage  local  information-sharing  among  the  environments, 

emj~Gj.  Gj^DPinGoj),  m  =  j  =  l.  --.J  (15) 

where  Goj  is  the  j-th  marginal  of  a  probability  measure  Go-  Each  DP  partitions  the  environments  into 
clusters,  with  information  shared  within  each  cluster  for  a  particular  subset  of  0.  This  is  appealing  when 
the  environments  are  only  partly  similar.  The  drawback,  however,  is  that  it  ignores  the  correlation  between 
subsets  of  0,  and  is  prone  to  generate  an  unnecessarily  large  number  of  clusters. 

Alternatively,  one  may  place  a  DP  prior  on  all  components  of  0  to  encourage  global  information-sharing, 

0m  ~  G.  G  ~  DP(aGo),  m  =  1, 2,  •  •  •  .  M  (16) 

A  clear  drawback  is  that,  under  the  global  DP  prior,  any  two  partly  similar  environments  will  be  forced  into 
the  same  cluster  or  they  will  be  allocated  to  different  clusters;  in  the  former  case,  the  idiosyncratic  subsets 
of  0  will  be  learned  inappropriately  due  to  the  wrong  information-sharing,  while  useful  information  is 
forfeited  for  the  related  subsets  in  the  latter  case. 

The  dependent  local  partition  process  (LPP)  [15]  is  a  nonparametric  Bayesian  prior  imposing  that  when 
two  environments  share  a  certain  subset  of  RPR  parameters,  say  2,,  they  are  encouraged  to  share  other 
subsets  {If  :  j'  A  j}-  Such  a  prior  promotes  correlated  local  clusters  to  capture  the  dependence  between 
RPR  parameters.  Formally,  we  specify  the  LPP  prior  on  {©m}  as  follows: 

©mj ~ %<%m + ( 1  - r/j )<%m . ,  r/i~Be(l.  J).  j  =  1.- •  •  .J 


(17) 


G~DP(aG o).  m  =  M 

Qmj^Gj.  Gj~ DP(ctGoj),  m=  1.  ••  ,A/.  j  =  l.  •  ••  .J 

where  Go  is  the  base  probability  measure  and  a.  3  >  0.  Following  [15],  we  denote  the  LPP  prior  as 
LPP{a.3.G0).  ■ 

We  have  specified  the  LPP  in  (17)  differently  than  in  [15],  to  make  it  easier  to  discern  the  structure. 
It  is  clear  form  (17)  that  the  LPP  reduces  to  the  global  DP  in  (16)  or  the  local  DPs  in  (15),  when  3 
takes  extreme  values.  As  3  — *  0,  ©mJ  =  ©mj,  with  ©m  drawn  from  DP(aGo).  As  3  — *  oc,  ©m,  —  ©m; 
is  drawn  from  DP(aGoj)-  Thus,  LPP(a,3  — 1 >  0.  Go)  is  reduced  to  DP(aGo),  which  is  a  global  DP 
imposed  on  {©m},  and  LPP{a.3  — ►  oc ,  G0)  is  reduced  to  J  independent  DPs,  {DP{a  Goj)}/=1,  where 
DP(aGoj)  is  a  local  DP  independently  imposed  on  Qmj,  with  the  local  DP  base  Goj  being  the  j-th 
marginal  of  the  global  DP  base  Go-  With  0  <  3  <  oc,  the  density  of  ©mj  is  a  mixture  of  two  point 
masses,  respectively  centered  at  a  sample  from  the  global  DP  and  a  sample  from  the  J-th  local  DP,  thus 
the  LPP  generally  combines  the  global  DP  and  independent  local  DPs. 

The  random  probability  measures  G  and  {Gj}  can  be  explicitly  expressed  by  the  stick-breaking 
construction  of  [16], 

TC  OC 

G  =  .  Gj  -  ^A .  j  =  1.  •  •  •  .  J 

1=1  1  =  1 

Aji  =  Ajj  II ( 1  Aji),  A*j  ~  Be(l,  a),  j=0,l,--  ,  J 

l<i 

©;~G0.  (§«,•••  Go.  1.2.-.-  (18) 

which  will  be  used  to  derive  the  Gibbs  sampler  for  posterior  inference. 


B  Analyzing  the  LPP  Clustering  Mechanism 

The  expressions  in  (17)  and  (18)  provide  insight  into  the  clustering  mechanism  of  the  LPP,  which  we 
analyze  below.  It  is  seen  that  the  LPP  promotes  clustering  through  the  discrete  random  measures  G  and 
{G7},  where  G  is  drawn  from  the  global  DP  and  is  responsible  for  global  clustering  of  {©m}m=i>  while 
G,  is  drawn  from  the  j-th  local  DP  and  responsible  for  local  clustering  of  {©mj }m=i-  The  proportions  in 
the  LPP  are  clearly  seen  from  (17),  which  shows  that,  a  sample  ©mj  drawn  from  the  LPP  has  two  choices: 
it  enters  some  global  cluster,  along  w'ith  {©my  :  j'  ^  j},  with  an  average  probability  of  — ;  it  enters 

a  local  cluster,  independently  of  {©mj'  :  j'  ^  j},  with  an  average  probability  of  The  simultaneous 
global  and  local  clustering  yields  correlated  local  clusters. 

Analytic  expressions  have  been  given  in  [15]  for  p{Qmj  =  @m'j )  and  p(©mj  =  =  ©m'j')> 

Vm  7^  m'.j  7^  /,  which  yield  Proposition  10.1. 

Proposition  JO.  1:  Let  ©m.©m/  Q  with  G  ~  LPP(a.  3. Go).  Denote  Ci(a.  3)  =  p(Qmj’  =  Qm’j’) 
and  G2(a.  i )  =  p(©rnj'  =  ©m/j/|©mj  =  ©m'j )•  Then 

C2(ot.3)  .  4a 

Cx{a,3)  (32  +  3  +  2)2 

It  is  clear  that  G2(a.J)  >  Gi(a.,J),  because  a  >  0.  Thus  knowledge  that  tasks  m  and  m'  are  in  the 
same  cluster  for  2^  strictly  increases  the  probability  that  these  tasks  are  in  the  same  cluster  for  Xj>.  In  this 
sense,  we  say  the  local  cluster  for  2)  is  positively  correlated  with  the  local  cluster  for  Ty. 

The  analysis  based  on  Proposition  10.1  considers  only  two  subsets  of  ©  and  does  not  reveal  the 
correlation  among  n  >  2  subsets.  In  what  follows,  we  extend  the  analysis  to  2  <  71  <  J  subsets.  Our 
analysis  begins  with  Lemma  10.2,  where  we  provide  an  analytic  formula  for  the  joint  probability  that  two 
tasks  are  in  the  same  cluster  for  n  distinct  subsets  of  ©.  The  lemma  is  proven  in  the  Appendix. 
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Lemma  10.2:  Let  {ji-ji.  ■  ■  ■  .jn}  C  {1.2.---  .  J}  have  distinct  elements.  Let  ©m  =  (0mi,0m2; 
and  ©„,'  =  (0m'i- 0m'2- •  •  •  .  ©m'j)  be  i.i.d.  drawn  from  G  with  G  ~  LPP(a.  3.  G0),  then 


©mJ  ) 


OCipi  ©mji  —  ©m'ji  •  ©mjj  —  © 


m'j  2  • 


.emjn  =  en 


if ) 
Jn  / 


1 


(1  +  a)n+1(2  +  J)n 


2d+ a) 
1+  3 


+  3 


+  a3n 


It  is  easy  to  verify  that  the  two  formulae  in  [15]  are  special  cases  of  the  formula  in  Lemma  10.2, 
corresponding  to  n  =  1  and  n  =  2,  respectively.  Lemma  10.2  provides  complete  information  for  the 
correlations  among  different  subsets  of  0  when  considering  the  clustering  of  two  tasks.  Here,  we  are 
particularly  interested  in  the  additional  change  of  the  probability  of  0mjl  =  when  one  observes  the 
new  local  cluster  0 

{©TOjfc  =  ©m'j* r 


mjn+1  =  0m'jn^,,  given  that  one  has  already  observed  n  —  1  previous  local  clusters 
k.  =  2. 3,  •  •  •  .  n}  before  observing  the  new  one. 

.i.d. 


Proposition  10.3:  Let  0m .  0 


G  with  G  ~  LPP(a.  3.  Go).  Then  it  holds 


Cn(a.3) 

Def 

=  P  (  ©  mj  i =©  m  j  *  |0mj2=©m'j2?  >  ©mjn=©m'j„  ) 

<  P(0mji=0fn'ji  I©mj2=©m'ja! '  '  'i0mjn-i  =©m'jn- i) 

=  C„+i(q.  3) 

Proof.  Since  both  sides  are  positive,  we  need  only  to  prove  that  the  ration  of  the  right  side  to  the  left 
side  is  larger  than  one.  Denote  £  =  +  3.  The  ratio,  using  Lemma  10.2,  is 

C+1  +  q  C-1  + 

Cn  +  a3n  Cn  +  a3n 
_  C2n  +  a232n  +  C+1a3n~l  +  Cn-1atfn+1  ^ 

_  C2n  +  a2/?2"  +  2(natfn  >  1' 

because 

Cn+la3n-1  +  C~1*Pn+1  _  1  (Q 

2Cnad"  “  2  V 1  +  C  J  >  ' 

where  the  last  inequality  is  arrived  using  the  facts  that  ~  >  1  and  x  +  i  >  2  for  x  >  1.  Q.E.D. 

It  is  clear  from  Proposition  10.1  that  Ci  <  C3  <  ■  ■  ■  <  Cv  <  Cn+\  <  ■  •  ■  <  Cj,  which  shows  that  the 
LPP  prior  cumulatively  increases  the  probability  of  0mjj  =  ©mj,,  when  tasks  m  and  m'  are  observed  to 
cluster  for  an  increasing  number  of  other  subsets  of  0.  In  other  words,  each  observation,  say  Qmjk  =  ©mj*. 
(k  >  1),  increases  the  probability  of  ©mj,  =  0mjl  on  the  basis  of  the  increases  brought  by  the  previous 
observations  {©mj,  =  0mj,.  :  i  =  2.  •  •  •  .  k  -  1}.  It  is  noted  that  Proposition  10.1  has  shown  that  C\  <  C2, 
which  along  with  Proposition  10.3,  establishes  that  { Cn  :  n  >  1}  is  a  strict  monotonically-increasing 
sequence.  Therefore  the  two  propositions  provide  a  complete  picture  of  the  positive  correlation  between 
the  local  cluster  formed  on  a  single  subset  of  0  and  the  local  clusters  formed  on  multiple  other  subsets 
of  0. 


C.  The  Relevance  to  MTRL 

The  correlation  analysis  above  shows  that  the  LPP  has  a  more  flexible  clustering  structure  than  either 
a  global  DP  as  in  (16)  or  multiple  local  DPs  as  in  (15),  which  allows  the  LPP  to  capture  a  richer  set 
of  similarity  patterns  among  the  tasks  and,  accordingly,  to  make  information  sharing  more  effective.  The 
positive  correlation  is  particularly  appealing  in  our  present  case,  in  which  an  RPR  policy  is  sought  in  each 
environment  to  accomplish  the  task  of  accruing  long  term  reward.  Each  subset  of  RPR  parameters  assume 
a  particular  responsibility  in  the  task  and  two  different  subsets  of  parameters  may  need  to  coordinate  with 
each  other  to  make  an  overall  functioning  policy. 


Consider,  for  instance,  several  subsets  of  RPR  parameters,  respectively  performing  action  selection 
in  distinct  and  yet  related  belief  regions.  The  actions  in  these  regions  must  coordinate  to  produce  a 
sequence  of  actions  that  lead  to  the  desired  consequence.  If,  indeed,  two  environments  have  some  similar 
belief  regions  and  the  same  consequence  (hence  policy)  is  desired  with  respect  to  these  regions,  then  the 
action  selection  in  these  regions  must  be  shared  across  the  environments.  Independent  local  sharing  by 
independent  DPs  ignores  these  relations  and  leads  to  inefficient  information  usage.  On  the  other  hand, 
complete  global  sharing  by  DP  is  inappropriate  for  partially  similar  environments. 

In  the  MTRL  we  consider  here,  the  LPP  is  imposed  on  the  RPRs,  not  on  the  associated  environments 
(POMDPs),  When  the  prior  knowledge  is  about  the  environments,  how  does  one  transform  it  into  the 
knowledge  about  the  RPRs?  If  we  are  specifying  the  knowiedge  about  each  individual  environment,  then 
we  indeed  need  to  transform  it  into  the  knowledge  about  each  associated  RPR.  However,  the  prior  we 
are  trying  to  impose  is  about  the  relations  between  the  environments.  For  the  relational  prior,  we  do  not 
need  the  transform  itself;  we  only  need  to  require  that  the  transform  is  continuous,  in  the  sense  that,  if 
two  POMDPs  (say  m  and  m')  are  similar  for  part  i ,  then  their  .corresponding  RPRs  are  similar  for  part  j. 

To  examine  whether  such  a  requirement  is  satisfied  in  our  case,  we  recall  that  each  POMDP  is  a  belief- 
state  MDP  (Markov  decision  process)  and  that  the  corresponding  RPR  is  defined  in  terms  of  the  regions 
in  the  belief-state  space  [13].  The  locality  with  respect  to  (w.r.t.)  the  POMDP  states  corresponds  to  the 
locality  w.r.t.  to  the  belief-states,  which  then  transfer  to  the  locality  w.r.t.  the  belief-regions  in  the  RPR, 
if  the  belief  regions  form  a  Markov  partition  of  the  belief-state  space  [11],  When  the  last  condition  is 
not  satisfied  exactly,  the  locality  correspondence  between  POMDP  and  RPR  is  approximate.  However, 
this  does  not  affect  our  method  too  much,  given  that  the  RPR  yields  a  stochastic  policy  and  that  the 
information  sharing  here  is  probabilistic  under  a  nonparametric  Bayesian  prior.  The  advantage  of  our 
method  is  still  prominent,  as  demonstrated  by  the  experimental  results. 


D.  Posterior  Inference 

We  first  introduce  some  latent  variables.  For  j  =  1, •••  .  J,  let  sm]  E  {0,1}  with  smj  =  1  denoting 
0mj  =  and  smj  =  0  denoting  0mj  =  Qmj.  Let  p m0  6  {1. 2.  •  •  •  .  oc}  index  the  global  cluster  that  task 
m  is  allocated  to.  Let  pmj  E  {1. 2.  •  •  •  .  oc}  index  the  local  cluster  that  task  m  is  allocated  to,  concerning 
the  j-th  subset  of  0.  It  is  easy  to  see  that  p{smj  =  1)  =  Pj,  and  p{pmj  =  *’)  =  ^ji  for  j  =  0. 1.  •  •  •  ,  J 
and  ;  =  1.  2.  •  •  ■  .  oc. 

We  are  interested  in  the  posterior  of  {©m}^=1,  given  the  LPP  prior  specified  in  (17)  and  the  episodes 
U‘^=]T>{Km).  To  allow  inference  of  the  LPP  parameters,  we  put  a  Gamma  prior  on  each,  i.e.,  we  assume  a 
priori  that  a  ~  Ga(aQ.  bQ )  and  d  ~  Ga(a,j.  bg).  We  employ  a  hybrid  approach  to  posterior  inference,  based 
on  the  stick-breaking  construction  in  (18).  Specifically,  we  employ  the  slice  sampler  in  [17]  to  perform 
conditional  Gibbs  sampling  of  {vmj}  U  {smj}U  {A*,}  U  {pmj}  U  {pj}  U  {a,  3}  given  {©*}  U  {©■  },  where 
{iimj}  are  auxiliary  latent  variables  conditional  on  which  the  infinite  mixtures  in  G  and  {G^}  become 
finite.  Given  the  Gibbs  samples,  we  then  employ  the  variational  Bayesian  (VB)  algorithm  in  [13]  to  infer 
{£>*}  U  {©*}•  The  steps  of  the  hybrid  Gibbs- variational  approach  are  summarized  as  follows. 

Step  1.  Draw  umj  ~  Unif(0.  AJv5mj),  j  =  0, 1 . 7. 

Step  2.  Draw  smj  ~  Ber(pmj),  w'ith 

Pm>  n 

where  Vr(-)  is  the  empirical  value  function  as  defined  in  Definition  8.2,  ©m(„mj.=i)  is  ©m  with  Qmj  =  ©^mW- 
and  ©mi.ifn^o)  is  ©m  with  Qmj  ©^mjj- 
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Step  3 .  Draw  X*-  from  the  conditional  density 

M  J 

m  i- ••)«(!  -Ay'-n  Il:(  n 1  -A3  >  w 

m=lj=0  /<Omj 


Let  =  max{^mj  :  m  =  1 . A/}.  It  is  easy  to  show  p(A*,|-  •  •)  =  Be( A*, 1 1 .  a)  for  i  >  p*  while,  for 

»'  <  'Pp  P(^ji\‘  •  •)  oc  Be(A£|l,a)I (djj  <  A*  <  e$)  with 


d~  =  max 


df  =  1  -  max  ,  ,  , 


f  ^  mj 

{ 


Ymj 

mj 


=0 


•  Y-'mj 


>■} 


S/gjP  4.  Draw  pm0  and  according  to 


P(<£mo|  •  *  • )  oc  I(ipm0  €  rm0)V(V<hm):  Om(Smj=i)) 

Pip-raj  I  ‘  ‘  ‘  )  €  rmj  )V  (X^  0m(smJ=O)) 

where,  for  j  =  0, 1.  •  •  •  .  J,  Tmj  -  {i  :  Xp  >  umj). 

Step  5.  Draw  rjj  ~  Be(l  +  smj>  0  +  £m(l  -  smj)). 

Step  6  Draw  3  ~  Gafa^  +  J,  bp  —  J2j=i  l*->g(l  —  Pj))- 

Step  7.  Draw  ct  ~  Ga(aQ  4-  Y,UoVj<b<*  _  Ej=oX£i  *°g( 1  ~  A’,)),  with  {^*}  given  in  Step  3. 

Step  8.  For  j  =  1.  2.  •  •  •  .  J,  i  =  1, 2,  •  •  • ,  do  the  following.  If  {m  :  smj  =  1.  pmo  =  i}  is  nonempty,  infer 
by  applying  the  VB  algorithm  to  USm~Wm0=,X>(Km)  with  the  prior  G0j:  if  {m  :  smj  =  0,pmj  =  i} 
is  nonempty',  infer  0T  by  applying  the  VB  algorithm  to  USmi=o  ^mj=xV' hm  with  the  prior  Goj- 


XI.  Experimental  Results 

We  consider  the  ten  maze  navigation  tasks  in  [13].  Of  the  ten  environments,  the  first  four,  the  following 
three,  and  the  last  three,  are  respectively  duplicates  of  the  gird-world  (a),  (b),  and  (c)  in  Figure  3.  The 
grid-worlds  (a)  and  (c)  are  partly  similar  for  the  cells  enclosed  by  dashed  lines.  We  use  these  ground 
truths  in  analyzing  the  sharing  mechanism  later. 


Fig  3.  The  three  distinct  grid-world  environments  considered  in  [13],  with  the  goal  indicated  by  a  basket  The  goal  states  are  fully 
observable. 

We  follow  the  experimental  setup  used  in  [13]  to  replicate  the  results  for  comparison.  In  particular, 
we  set  \Z\  =  6  for  all  RPRs  and  perform  off-line  learning,  assuming  the  episodes  {DlAm>}^l=1  are 
collected  beforehand,  by  taking  random  actions  or  querying  a  PBVI  expert,  the  selection  between  the 
two  manifested  with  probability'  of  0.5.  The  Gamma  hyper-parameters  for  a  and  3  in  the  LPP  are  set  to 
aa  =  bn  —  cia  =  bp  =  1.  The  same  base  measure  Go  is  used  for  the  LPP  and  all  DPs  (global  and  local), 
and  the  form  of  Go  is  given  in  (14)  with  all  Dirichlet  hyper-parameters  set  to  one. 
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We  compare  the  proposed  LPP-based  MTRL  method  to  the  following  methods:  STR1,  PBRL,  DP-based 
MTRL,  and  IDP-based  MTRL,  where  IDP  is  a  set  of  independent  DPs  with  each  associated  with  a  subset 
of  0.  The  DP  is  implemented  by  the  LPP  with  8  set  to  10~20  and  the  IDP  is  implemented  by  the  LPP 
with  J  set  to  1020. 

The  results  are  reported  in  Figure  4.  It  is  seen  that  the  LPP-based  MTRL  earns  significantly  larger 
rewards  than  the  DP-based  MTRL.  The  improvements  are  attributed  to  the  higher  goal  rates,  since  the 
LPP  actually  takes  a  larger  number  of  steps  to  reach  the  goal.  The  improvements  are  most  significant 
when  the  number  of  episode  is  small  (<  10  here).  The  IDP-based  MTRL  performs  most  poorly  among 
the  methods. 

The  performance  of  each  method  can  be  explained  by  the  sharing  patterns  it  infers,  which  we  visualize 
using  Hinton  diagrams  [13],  shown  in  Figure  5  for  K  =  3  episodes  (top  row)  and  K  =  120  episodes 
(bottom  row).  The  block  (i,j)  is  each  diagram  displays  the  frequency  the  tasks  i  and  j  are  assigned 
to  the  same  cluster  in  the  last  1000  iterations  of  Gibbs  sampling.  It  is  seen  that  DP  infers  three  global 
clusters,  respectively  corresponding  to  the  three  grid-worlds  in  Figure  3.  By  contrast,  the  LPP  combines 
grid-worlds  (a)  and  (c)  into  a  single  global  cluster,  to  capture  the  similar  parts  enclosed  by  the  dashed 
lines  shown  in  Figure  3,  with  the  differences  between  (a)  and  (c)  distinguished  by  splitting  them  in  local 
clusters. 

An  example  is  given  in  the  fifth  column  of  Figure  5,  where  it  is  seen  that  the  LPP  tends  to  split  grid- 
worlds  (a)  and  (c)  for  W{:,a,  o', :),  which  involves  the  belief-region  transitions  when  walking  west  leads 
to  observing  walls  on  the  south  and  north.  It  is  clear  from  Figure  3  that  such  an  (action,  observation) 
pair  exclusively  leads  towards  the  goal  in  grid-world  (a)  while  it  may  also  move  away  from  the  goal  in 
grid-world  (c).  By  locally  splitting  them,  the  LPP  encourages  respective  appropriate  transitions  in  the  two 
grid-worlds. 

The  diagrams  show  that  the  IDP-based  MTRL  has  a  strong  tendency  of  isolating  tasks,  which  is 
detrimental  to  information  sharing  and  explains  its  poor  performances.  For  example,  the  third  column  of 
Figure  5  shows  the  local  sharing  patterns  involving  the  belief-region  transitions  when  walking  north  leads 
to  seeing  the  goal.  Clearly,  this  (action,  observation)  pair  leads  to  laige  rewards  in  both  grid-worlds  (a) 
and  (c)  and  hence  local  sharing  between  them  is  helpful  here.  It  is  seen  that  the  IDP  needs  a  large  amount 
of  episodes  (K  =  120)  to  infer  this,  while  the  LPP  infers  this  using  only  three  episodes. 
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Fig.  4  Performance  comparison  on  the  ten  maze  navigation  tasks,  as  a  function  of  the  number  of  episodes  per  environment 
used  by  the  algorithms.  Left:  Average  success  rate  for  the  agent  to  reach  the  goal  within  15  steps.  Middle4  Average  steps  taken 
by  the  agent  to  reach  the  goal.  Right:  Discounted  cumulative  reward  with  the  discount  y  =  0.95. 


XII.  Summary  on  Networked  POMDPs 

We  have  presented  a  new  framework  for  multi-task  reinforcement  learning,  based  on  simultaneous  global 
and  local  information-sharing  imposed  by  the  LPP.  W'e  have  extended  the  second-order  analysis  in  [15]  to 
higher-order  analysis  involving  an  arbitrary  number  of  subsets  of  parameters.  Our  analysis  provide  further 
insights  into  the  clustering  structure  under  the  LPP.  Experimental  results  demonstrate  the  new  MTRL 
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Fig.  5.  The  between-task  sharing  patterns,  represented  by  Hinton  diagrams,  inferred  by  DP  (column  1),  IDPs  (columns  4 
Si  6),  and  LPP  (column  2  for  the  global  sharing;  columns  3  &  5  for  the  local  sharing),  where  K  is  the  number  of  episodes 
per  environment  (A"  =  3  for  the  first  row  and  k  =  120  for  the  second  row.).  The  local  sharing  patterns  are  compared  for 
W(:.a.  o' . :),  with  a=“walk  north”  and  (/^‘goal”  in  columns  3-4,  and  a=“walk  west”  and  o/=4iwalls  on  the  north  and  south” 
in  columns  5-6.  Note  the  goal  states  are  fully  observable. 


method  yields  significant  performance  improvements,  relative  to  previous  published  results.  Future  work 
includes  extension  of  the  method  to  online  learning  and  the  study  of  exploitation  vs  exploration  within 
the  MTRL  framework. 


Appendix 

Proof  of  Lemma  10.2:  Let  smj  €  {0. 1}  with  smj  =  1  denoting  0mj  =  Qmj  and  smj  =  0  denoting  0mj  =  0mj. 
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Equation  (a)  follows  because  smj  is  independent  of  smy  and  pj  is  independent  of  r)y.  V  ]’  ^  j.  Equation  (b)  is  arrived  based 
on  that  p(9  =  9')  =  4t:>  V  9.9'  tA~  DP(aPo)  [18],  and  that  p{Qmjk  =  0m<jj  =  0  whenever  one  of  them  is  from  the 
global  DP  and  the  other  from  the  j\--th  local  DP,  and  that  p(0mj,  =  Qm'u- ' ' '  =  @m'Jn )  =  4a  reach  equation 

(c),  one  calculates  the  moments  of  i]jk  ~  Be(l,  J).  Q.E.D. 


XIII.  Review  of  Topic  Modeling 

Topic  models  attempt  to  infer  sets  of  words  from  text  data  that  together  form  meaningful  contextual 
and  semantic  relationships.  Finding  these  groups  of  words,  known  as  topics,  allows  effective  clustering, 
searching,  sorting,  and  archiving  of  a  corpus  of  documents.  If  we  assume  the  bag-of-words  structure,  i.e., 
that  words  are  exchangeable  and  independent,  then  there  are'  in  general  two  ways  to  consider  a  collection 


of  documents.  Factor  models  such  as  probabilistic  Latent  Semantic  Indexing  (pLSI)  [19],  Latent  Dirichlet 
Allocation  (LDA)  [20]  and  Topics  over  Time  (TOT)  [21]  assume  that  each  word  in  a  given  document 
is  drawn  from  a  mixture  model  whose  components  are  topics.  Other  models  assume  that  words  in  a 
sentence  or  even  in  an  overall  document  are  drawn  simultaneously  from  one  topic  [22],  [23].  In  [22], 
the  authors  propose  modeling  topics  of  words  as  a  Markov  chain,  with  successive  sentences  modeled 
as  being  likely  to  share  the  same  topic.  Since  topics  are  hidden,  learning  and  inferring  the  model  are 
done  using  tools  from  hidden  Markov  models.  Whether  one  draws  a  topic  for  every  word  or  considers 
all  words  within  a  sentence/document  as  being  generated  by  a  common  topic,  documents  are  represented 
as  counts  over  the  dictionary,  and  topics  are  represented  as  multinomial  distributions  over  the  dictionary. 
This  approach  to  topic  representation  is  convenient,  as  the  Dirichlet  distribution  is  the  conjugate  prior  to 
the  multinomial.  However,  because  the  distribution  over  the  dictionary  must  be  normalized,  problems  can 
occur  if  a  previously  unknown  word  is  encountered,  as  can  often  happen  when  using  a  trained  model  on 
an  unknown  testing  set. 

A  new  factor  model  has  been  proposed  [24]  that  represents  each  integer  word  count  from  the  term- 
document  matrix  as  a  sample  from  an  independent  poisson  distribution.  This  model,  called  GaP  for 
gamma-poisson,  factorizes  the  sparse  term-document  matrix  into  the  product  of  an  expected-counts  matrix 
and  a  theme  probability  matrix.  Note  that  the  GaP  model  is  equivalent  to  placing  a  multinomial-Dirichlet 
implementation  over  the  dictionary,  so  that  one  can  model  both  the  relative  word  frequencies  and  the 
overall  word  count.  One  may  use  the  poisson-gamma  characterization  as  a  starting  point  to  building  a 
dynamic  topic  model  by  using  a  closely-related  approach  to  [25].  Using  an  independent  distribution  for 
each  word  is  attractive,  as  it  addresses  the  problem  of  adding  unknown  words  to  the  dictionary.  Further, 
since  each  word  is  allowed  to  evolve  independently,  this  approach  leads  to  a  more  flexible  model  than 
using  a  traditional  multinomial-Dirichlet  structure.  We  build  upon  this  construct  in  the  model  presented 
here. 

The  main  focus  of  this  component  of  the  project  is  on  development  of  a  hierarchical  Bayesian  model 
for  characterizing  documents  with  known  time  stamp.  Each  document  is  assumed  to  have  an  associated 
topic,  and  all  documents  at  a  given  time  are  assumed  to  have  topics  that  are  drawn  from  a  mixture  model; 
the  mixture  weights  in  this  model  evolve  with  time.  This  framework  imposes  the  idea  that  documents 
that  appear  at  similar  times  are  likely  to  be  drawn  from  similar  mixtures  of  topics.  To  achieve  this  goal, 
we  develop  a  simplified  form  of  the  dynamic  hierarchical  Dirichlet  process  (dHDP)  [26].  Inference  is 
performed  efficiently  via  a  variational  Bayesian  analysis  [8], 

Our  model  differs  from  other  time-evolving  topic  models  [27],  [28]  in  that  our  topics  do  not  evolve 
over  time;  what  changes  in  time  are  the  mixing  weights  over  topics,  while  the  overall  set  of  topics  are 
kept  unchanged.  Specific  topics  are  typically  localized  over  a  period  of  time,  with  new  dominant  topics 
spawned  after  other  topics  diminish  in  importance  (the  temporally  localized  topics  may  alternatively  be 
viewed  as  a  time  evolution  of  a  single  topic  [28],  but  such  single-topic  evolution  is  not  considered  here). 

XIV.  Review  of  Semi-Parametric  Statistical  Modeling 

The  Dirichlet  process  (DP)  is  a  semi-parametric  measure  for  development  of  general  mixture  models 
(in  principle,  in  terms  of  an  infinite  number  of  mixture  components).  Let  H  be  a  measure  and  a  is 
a  non-negative  real  number.  A  draw  from  a  Dirichlet  process  parameterized  by  a  and  H  is  denoted 
G~DP(a.  H).  Sethuraman  [29]  introduced  the  stick-breaking  representation  of  a  DP  draw: 

G  =  Efcli 

^  =  VkUf-1(l-Vl),  Vk'^Beta(  l,a).  d*k  H.  (19) 

where  Sq-  is  a  point  measure  concentrated  at  9*k  (each  0*k  is  termed  an  atom),  and  Beta(l,a)  is  a  beta 
distribution  with  shape  parameter  a.  Note  that  G  is  almost  surely  discrete,  with  this  playing  a  key  role  in 
the  utility  of  DP  for  clustering.  To  simplify’  notation  below,  an  infinite  probability  vector  it  constructed 
as  above  is  denote  7r  ~  Stick(a). 

Suppose  the  data  of  interest  are  divided  into  different  sets,  and  each  data  set  is  termed  a  “task”  for 
analysis.  For  clustering  of  T  tasks  the  DP  imposes  the  belief  that  when  two  tasks  are  associated  with  the 
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same  cluster,  all  data  within  the  tasks  are  shared.  This  may  be  too  restrictive  in  some  applications  and 
has  motivated  the  hierarchical  Dirichlet  process  (HDP)  [30],  We  denote  the  data  in  task  £  as 
where  Nt  is  the  number  of  data  in  the  task.  The  HDP  may  be  represented  as 

Xu  ~  f(eu)\  *  =  1,2 . Nt,  £=1.2 . T. 

0tj  ~  Gt:  i  =  l,2 . Nt:  £  =  1,2 . T. 

G,  ~  DP(a,  G);  £=1.2 . T, 

G  ~  DP(i,H),  (20) 

where  /(#)  represents  the  specific  parametric  model  under  consideration.  Because  the  task-dependent  DPs 
share  the  same  (discrete)  base  G ,  all  {G,}^  share  the  same  set  of  mixture  atoms,  with  different  mixture 
weights.  The  measures  {G(}(=1.-r  are  jointly  drawn  from  an  HDP: 

{G],  •  •  •  . Gj}  ~  H DP(a.'),H).  (21) 

The  HDP  assumes  the  T  tasks  are  exchangeable;  however,  there  are  many  applications  for  which  it  is 
desirable  to  remove  this  exchangeability  assumption  Models  such  as  the  kernel  stick  breaking  process 
[31],  [32],  the  generalized  product  partition  model  [33],  the  correlated  topic  model  [34]  and  the  dynamic 
DP  [35]  are  techniques  that  impose  structure  on  the  dependence  of  the  tasks  (removing  exchangeability). 
Some  of  these  models  rely  on  modifying  the  mixing  weights  to  impose  dependence  on  location  [31],  [32] 
or  covariate  [33],  while  others  impose  sequential  time  dependence  on  the  structure  of  consecutive  tasks 
(see  [35]). 

We  again  consider  T  tasks,  but  now  index  £  explicitly  denotes  the  sequential  time  of  data  produc¬ 
tion/collection.  To  address  the  sequential  nature  of  the  time  blocks,  [26]  imposes  a  dynamic  HDP  (dHDP) 

Gt  =  wtDt  +  )Gt_,;  £  =  2 . T,  (22) 

w'here  {Gi,  Di- •  •  •  ,  Dj]  ~  H DP(a.^.H).  The  parameter  u  t  €  [0. 1  is  drawn  from  a  beta  distribution 
Detn(ao ■  bo),  and  it  controls  the  degree  of  innovation  in  Gt  relative  to  Gt_ j.  The  DP  and  HDP  are  limiting 
cases  of  this  model: 

•  when  wt  — -  0,  Gt  —*  G,_i  and  there  is  no  innovation,  resulting  in  a  common  set  of  mixture  weights 
for  all  time  blocks  (DP); 

.  when  xi't  — *  1,  Gt  — ►  Dt,  where  the  new  innovation  distribution  Dt  controls  the  sharing  mechanism, 
resulting  in  each  time  block  having  a  unique  set  of  mixing  weights  (HDP). 

It  is  important  to  restate  that  dHDP  does  not  assume  the  mixture  components  evolve  over  time,  only  the 
mixing  weights.  The  mixture  components  are  shared  explicitly  across  all  time  blocks.  This  is  fundamentally 
different  from  other  models  that  impose  temporal  dependence  through  component  evolution  [28],  [27], 
this  allowing  a  unique  and  independent  set  of  mixing  weights  for  each  block. 

XV.  Semi-Parametric  Dynamic  Topic  Model 

A.  Model  construction 

Consider  a  collection  of  documents  with  known  time  stamps,  with  time  evolving  from  £  =  1.  •  •  •  ,T. 
At  any  particular  time  we  have  Nt  such  independent  documents.  The  total  set  of  documents  over  all  time 
is  represented  as  where  xt.t  represents  a  vector  of  word  counts  associated  with  document 

i  at  time  £.  In  the  form  of  the  model  presented  here,  we  are  only  interested  in  the  number  of  times 
a  given  word  is  present  in  a  particular  document;  the  set  of  J  unique  words  in  the  collection  forms  a 
dictionary.  Each  document  is  assumed  characterized  by  a  single  topic,  and  at  time  £  the  topics  across  all 
documents  are  assumed  drawn  from  a  mixture  model.  In  the  proposed  model  the  mixture  weights  on  the 
topics  are  assumed  to  evolve  with  time  (analogous  to  as  implemented  in  the  dHDP  [26]  discussed  above). 
The  assumption  that  each  document  is  characterized  by  a  single  topic  may  seem  restrictive;  however,  we 
observe  in  Section  XVIII  that  for  our  motivating  example  this  assumption  is  reasonable. 


Connect  to  Mixture  Weights 


tj 
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(a)  (b) 

Fig.  6.  Dynamic  Dirichlet  topic  model  (dDTM).  (a)  graphical  representation  of  the  model,  (b)  expanded  representation  of  the  product 
measure  aspect  of  the  model. 


To  constitute  a  model  with  a  time-evolving  mixture  of  topics,  we  seek  a  simplified  representation  of 
the  dHDP.  Specifically,  the  proposed  topic  model,  termed  dDTM  for  dynamic  Dirichlet  topic  model,  is 
represented  as 
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Note  that  T\  =  it The  factorized  structure  H  =  n /—  1  is  similar  to  [24],  which  allows  insertion  of 
new  words  with  time. 

Although  perhaps  not  apparent  at  this  point,  for  large  K  the  proposed  model  is  closely  related  to  dHDP; 
this  is  analyzed  in  detail  below.  The  model  is  represented  graphically  in  Fig.  6(a),  and  in  Fig.  6(b)  we 
illustrate  how  a  single  mixture  component  is  drawn,  with  the  parametric  model  of  each  dimension  drawn 
independently  from  its  respective  prior. 

The  form  of  the  parametric  model  F(-)  in  (23)  may  vary  depending  on  the  application;  in  the  work 
presented  here  it  corresponds  to  a  multinomial-Dirichlet  model.  We  consider  the  number  of  times  a  word 
is  present  in  a  given  document;  to  do  this,  F(-)  is  defined  as  a  multinomial  distribution  and  consequently, 
to  preserve  the  conjugacy  requirements,  each  HJ  is  a  Dirichlet  distribution. 
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B.  Relationship  to  dHDP 

We  now  make  explicit  the  relationship  between  dHDP  [26]  and  dDTM  represented  in  (23).  Recall  that 
the  draws  {Gi.D?.  •  •  •  .  Dr)  ~  HDP(a.  7,  H)  may  be  constructed  as  [30] 

OO 

G 1  =  2^  7Ti.A-^e;. 

fc= 1 

Dt  =  t  =  ,T 

k=  1 

7r(  ~  DP{a.v )  . 

v  ~  Stick(7)  (24) 

0*  ~  H  k  =  1,  •  •  •  ,00 


The  draw  7r,  ~  DP(a.v)  may  be  represented  m  stick-breaking  form,  with  the  kth  component  of  7rf 
constructed  as  7rt-fc  =  wLj8(Ytj  =  k),  with  wt  ~  Stick(a),  Ytj  ~  Mnlt(v);  8(Ytj  =  Ar)  equals  one 
if  l'tJ  =  and  its  zero  otherwise.  We  may  also  truncate  the  draw  v  ~  Stick(a)  to  K  sticks  (denoted 
vk  ~  Stick^{a )),  for  large  A'  [36].  Using  these  representations,  the  overall  HDP  construction,  when 
truncated  to  I\  topics  (atoms),  may  be  represented  as 
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(25) 


Note  that  we  truncate  St.ick( 7)  to  K  sticks,  but  do  not  truncate  Stick(a).  Additionally,  Ytj  €  {1,  •  •  •  ,  A'}, 
with  the  particular  value  of  Ytj  depending  on  which  component  is  selected  from  the  multinomial. 

To  appreciate  the  relationship  between  dHDP  and  the  proposed  dDTM,  note  that  (23)  corresponds  to 
drawing  atoms/topics  at  time  t  from  the  finite  mixture  model  Gt  =  wtDt  4-  (1  —  wt)Gt- 1,  with 

K 

G\  =  ^ 

k- 1 

K 

Dt  =  wt,k8&k  .  t  =  2,  •  •  •  ,  T 

k= 1 

7T(  ~  Dir(a/I\.  ■  •  ■  .a/ K)  (26) 

Gk  ~  H  ,  k  =!,•••, K 


Recall  that  Sethuraman  demonstrated  [29]  that  a  draw  7r  ~  Dir(aga),  where  go  is  a  A'-dimensional 
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probability  vector  and  a  >  0,  may  be  constructed  as 

OC 

7 r*.  =  y  WjS(Yj  —  k)  .  k  =!.••■  ,K 

j= i 

w  ~  Stick(a)  (27) 

V}  ~  Mult  (go)  .  j  =  1.  •  •  •  .  oc 

with  7T*.  representing  the  kth  component  of  7r.  Using  Sethuraman’s  stick-breaking  representation  of  the 
Dirichlet  distribution  in  (26),  the  proposed  dDTM  is  constructed  as 

K 

G\  =  y  Ti\.k$&k 

k= 1 
K 


Dt 

=  k$e, 

.  t  =  2.  -  ■  ■ 

,T 

k= 1 

oc 

Xt.k 

=  y  wAYo 

II 

Pi¬ 

ll 

l — ^ 

> 

'M- 

II 

1 — 1 

3=1 

Wt 

~  Stick(a) 

t  =  1,...  ,T 

Ytj 

~  Mult(l/K. 

•...1/A')  : 

j  =  1  •  •  -  .  oc:  £  =  !,••• 

®k 

~  H  .  k  = 

1.  .  .  ,  A' 

The  truncated  dHDP  model  in  (22)  draws  {Gi,  D2.  •  •  •  .  Dj]  from  (25),  assuming  Stick( 7)  is  truncated 
to  K  sticks  [36].  By  contrast,  within  dDTM  the  measures  {GX.D2.  •  •  •  ,Dt}  are  drawn  from  (28).  In  the 
former  the  random  variables  Ytj  are  drawn  from  vk,  which  is  in  turn  drawn  from  the  truncated  stick¬ 
breaking  process  Sticky (7);  in  the  latter  we  simply  set  Vk  =  (1/A',  •  •  •  .  \/K)  and  remove  the  parameter 
7  altogether.  It  is  felt  that  this  relatively  small  change  does  not  significantly  affect  the  expressibility  of 
the  proposed  prior.  Within  the  proposed  model  the  weights  wt  explicitly  impose  temporal  relationships 
between  the  topics  (documents  at  proximate  times  are  more  likely  to  share  the  same  topics). 

The  above  discussion  also  demonstrates  that  considering  the  Dirichlet  distribution  Dir(a/K,  •  •  •  ,  a/K) 
with  large  K  is  analogous  (but  distinct  from)  a  truncated  stick-breaking  representation.  In  this  sense,  the 
proposed  model  is  non-parametric,  in  that  setting  a  large  K  allows  the  model  to  infer  the  proper  number 
of  topics  from  the  data,  analogous  to  studies  of  the  truncated  stick-breaking  representation  [36].  Setting  a 
large  I\  ( e.g .,  K  =  50  in  the  examples  below),  does  not  imply  that  we  believe  that  there  are  actually  K 
topics,  since  from  (27)  only  a  relatively  small  set  of  components  in  TTt  will  have  appreciable  amplitude 
(the  same  type  of  motivation  for  the  stick-breaking  view  of  DP  and  HDP).  As  in  other  non-parametric 
methods,  the  proposed  model  infers  a  distribution  on  the  proper  number  of  topics,  based  on  the  data. 

We  also  emphasize  that  the  stick-breaking  representation  of  a  draw  from  a  Dirichlet  distribution  has 
been  introduced  above  to  make  the  connection  between  the  proposed  model  and  a  truncated  representation 
of  dHDP.  However,  when  actually  performing  inference,  it  is  often  simpler  to  just  draw  directly  from 
Dir(a/K. ■  ■  ■  .a/I\).  However,  this  issue  is  revisited  in  the  Conclusions. 

C.  Limiting  cases 

In  Section  XIV  we  considered  dHDP  under  limiting  cases  of  wt,  and  we  do  so  here  for  the  proposed 
dDTM  in  (23).  In  the  limit  wt  — ►  0,  the  dDTM  parameters  are  drawn  at  all  time  from  the  same  measure 
G 1  =  £k=i  7Ti  JcSek  with  7ri  Dir(a/K.  ■  ■  ■  .  a/K )  and  0^.  ~  H.  Therefore,  in  the  limit  K  — ♦  oc  and 
wt  — *  0  the  topic-model  parameters  for  dDTM  are  drawn  from  DP(a.H),  as  is  the  case  for  dHDP  when 
ivt.  — *  0.  Since  K  is  finite  in  dDTM,  the  limit  wt  — ►  0  yields  a  model  similar  to  LDA  [20]  (in  LDA  one 
performs  a  point  estimate  for  a,  while  here  a  is  set). 

In  the  limit  wt  — ►  1,  at  time  t  the  dDTM  model  parameters  are  drawn  from  Gt  =  Ylk=i  nt.k?>ek, 

again  with  0*.  ~  H.  and  with  each  7rf  ~  Dir(a/ K.  ■  •  ■  .a/K).  Thus  the  {Gt}t=i.r  all  share  the  same 
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atoms  (topics),  with  distinct  /-dependent  probability  weights  irt.  The  dHDP  model  has  a  similar  limit 

when  wt  — ■  1,  with  the  weights  drawn  7rf  ~  DP(a.v)  for  v  ~  Stick( 7).  In  both  cases  the  atoms/topics 
are  shared  across  all  time,  with  different  mixture  weights.  The  dHDP  arguably  allows  for  more  modeling 
flexibility,  through  the  parameter  7,  w'hile  dDTM  yields  a  simpler  model  with  very  similar  structural  form. 


XVI.  Model  Properties 

To  examine  properties  of  the  model  in  (23),  we  consider  the  discrete  indicator’s  space  I  =  {1.2 . A'} 

with  k  €  7  indicating  one  of  the  K  mixing  components  of  the  model.  Therefore,  we  can  write 

T((/)|r(_i,U',  =  (1  -  Wt)Tt.i{I)  +  WtTTt{I) 

=  rt_i(7)  +  Af(/).  (29) 

where  A/(7)  =  wt(nt(I)  —  rt_i(7))  is  the  random  deviation  from  Tt_i(I)  to  rf(7). 

Theorem  1.  The  mean  and  the  variance  of  the  random  deviation  At  are  controlled  by  the  innovating 
weight  u>t  and  model  parameter  a  =  [a/7\ . a/77]: 

£{A£(7)|t(_,  =  wt(E(nt{I))  -  rt_,(7)) 

=  Mijr . -t]  -Tt_i(7)).  (30) 


Vr{At(/)|r(_i,u>£,a} 


I< 


1 

a  +  1 


1 

o  +  l 


where  we  observe  two  limiting  cases: 

.  when  wt  — ►  0 ,  r,  =  rt~  1. 

.  when  Tt-x  —  [j< . £'{rf(7)|r(_1. «■,}  =  rt_i(7). 

Theorem  2,  The  correlation  coefficient  between  two  adjacent  distributions  r(_  1  and  rt  for  t  =  2 


Corr(Tt-i,k,  rt,k) 


E{Tt-i,kTt.k}  ~  £{rf-i .fejElrq.} 


(1  -«>,) 


N 


E[:Nfn^+i(i-^)2 


(31) 


T  is 


(32) 


for  any  A-  6  7.  The  proofs  of  these  theorems  are  provided  in  Appendix  A. 

To  compare  the  similarity  of  two  adjacent  tasks/documents,  the  two  theorems  yield  insights  through  the 
mean  and  variance  of  the  random  deviation  and  the  correlation  coefficient  which  can  be  estimated  from 
(32),  using  the  posterior  expectation  of  w.  Although  dDTM  represents  a  simplification  of  the  dHDP 
framework  [26],  the  sharing  properties  are  similar.  The  proofs  to  both  theorems  are  summarized  in 
Appendix  A. 


XVII.  Variational  Bayes  Inffrfncf. 

To  motivate  the  theory  of  variational  inference,  we  first  recognize  that  the  equality 


L 


Q(O)  In 


Q(O) 


-do 


-L 


Q(0)  In 


Q(0) 


-do. 


(33) 


P(0\X)P(X)~~  P(X\0)P{0) 

can  be  rewritten  as 

\nP(X)  =  C(Q)  +  KL(Q\\P).  (34) 

where  O  represents  the  model  latent  parameters  O  =  {{@fc}£=i-  z.  d.  {77 t}J=\-  X  the  observed  data, 

Q(O)  some  yet  to  be  determined  approximating  density  and 


C(Q)  =  [ 

Jo 


Q(O)  In 


P(X\0)P(0) 

Q(O) 


do.  KL(Q\\P)=  [  Q(O)  In 
Jo 


0(0} 

P(0\X) 


do. 


(35) 
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For  inference  purposes,  instead  of  drawing  zu  ~  Mult(rt),  we  use  an  extra  variable  dtA  indicating  the 
task/document  we  are  drawing  the  mixing  weights  ,  from:  for  each  document-dependent  xt-I  we  first 
draw  the  task  indicator  variable  dui  from  a  stick-breaking  construction  and  then  the  corresponding  topic 
indicator  ztA  as  follows: 


zu  ~  Mult(ndt  i),  dui  ~  M ult(V). 

Vq  =  wq  11?=! ( 1  —  n>j)>  wq  ~  Beta(l.d0).  (36) 

where  Beta(l,do)  corresponds  to  Beta{c^,do)  in  (23),  with  co  =  1. 

Therefore,  the  joint  distribution  of  the  indicator  variables  d  and  z  can  be  written  as  follows: 

Nt 

P(du  =  v.zu  =  k\nt.  0k.xu)  a  ( p( | &Zu=k ■  wt ) )p(zu  =  k\du,nt,  9k)p{du  =  v) 

1=1 

Nt  J  u —  1 

=  (n  n  m  uit(ejzt.i=k)7rt.kU'V  H(!  -  u>,),  (37) 

1=1  j= i  i=i 

where  Nt  is  the  total  number  of  documents  in  block  t,  and  xj  t  corresponds  to  word  j  in  xt.,. 

Our  desire  is  to  best  approximate  the  true  posterior  P{{@k}k=i,  z.d.  {nt}J=l.w\X)  by  minimizing 
KL(Q\\P),  and  this  is  accomplished  by  maximizing  £{Q).  In  doing  so,  we  assume  that  Q(O)  can  be 
factorized,  meaning 

Q{0)  =  Q{{&k}k=vzd'{'7rt}Lvw)  =  Q({Gk}k=i)Q(z)Q{d)Q{{nt}J=l)Q(w).  (38) 

A  general  method  for  writing  inference  for  conjugate-exponential  Bayesian  networks,  as  outlined  in  [37], 
is  as  follows:  for  a  given  node  in  a  graph,  write  out  the  posterior  as  though  everything  were  known,  take 
the  natural  algorithm,  the  expectation  with  respect  to  all  unknown  parameters  and  exponentiate  the  result. 
Since  it  requires  computational  resources  comparable  to  the  expectation-maximization  (EM)  algorithm, 
variational  inference  is  fast  relative  to  Markov  chain  Monte  Carlo  (MCMC)  [26]  methods  (based  on 
empirical  studies  for  this  particular  application,  and  depending  on  what  level  of  convergence  MCMC  is 
run  to). 

A.  VB-E  step 

For  the  VB-E  step,  we  calculate  the  variational  expectation  with  respect  to  all  unknown  model  param¬ 
eters  0k,  7r(  and  wt.  The  variational  equations  of  the  model  parameters  0k,  nt  and  wt  are  shown  below; 
their  derivation  is  summarized  in  Appendix  C.  The  analysis  yields 

e,  =  eIp[X;f;(^(4,)  +  .i3/A/-i)in(ei'")]. 

j= 1  ™.=1  7fc 

N  K  t- 1  K 

7ft  =  exp[Y_ l  ^((ln nt,k)  +  (lnic,)  +  J](ln(l  -  u-q)))  +  a/K  -  l)(ln  7rt.fc)], 

1  =  1  fc=  1  <7=1  1 

Nt  K  t- 1 

Wt  =  exp[^^((lnu’t)-F^(ln(l-ui<?)))  +  (d0-l)(ln(l-u't))], 

i=l  k=l  7=1 


where 


(In 7rtifc)  =  ■v((nt,k )  +  a)  -  iP{Nt  +  1). 
(4.i)  =  XtMXt.i\Ztri  =  k)’ 


(39) 
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T 

(In  wv)  =  xp(l  +  At.)  —  ip(  1  +  do  +  Arj). 

l=v 

T  T 

(\n(l  —  iui))  =  ip(do+  Arm)  —  yj(l  +  do  +  Arm),  (40) 

m=/+l  m=l 

with  w(')  the  digamma  function,  (nt.k)  the  number  of  words  sharing  topic  k  in  block  t,  (3  =  [3/M . 3/M] 

the  Dirichet  hyper-parameters  for  the  priors  on  the  words  distribution,  and  m  €  {1,2 . M }  a  possible 

outcome  of  the  multinomial  distributions  on  the  word  counts. 


B.  VB-M  step 

Updating  the  variational  posteriors  in  the  VB-M  step  is  performed  by  updating  the  sufficient  statistics 
of  the  model  parameters,  obtained  from  the  VB-E  step.  The  analysis  yields 

j 

QOk)  =  J]  Dh(3/M  +  (pkA), .....  3/M  +  <pfcJW}). 
j= i 

Q(  Wf)  =  Dir(ot/I\  +  (ntA),....a/K  +  (nt.K))- 

t- 1 

Q(wt)  =  Betaj  1  +  (mt).dc  +  (41) 

b=l 

where  (mt)  =  J2k=i  (nt.k)  and  (pk,m)  is  the  number  of  words  with  outcome  m  in  topic  k. 

XVIII.  Experimental  Results 

The  proposed  model  is  demonstrated  on  two  data  sets,  each  corresponding  to  a  sequence  of  documents 
with  known  time  dependence:  (i)  the  NIPS  data  set  [22]  containing  publications  from  the  NIPS  conferences 
between  1987  and  1999  and  (ii)  every  United  States  presidential  State  of  the  Union  Address  from  1790- 
2008. 

As  comparisons  to  the  dDTM  model  developed  here,  we  consider  LDA  [20]  and  TOT  [21],  and  dDTM 
with  innovation  weights  set  as  {wt}J=2  =  1  (termed  DTM).  For  the  dDTM  framework,  we  initialized 
the  hyper-parameters  as  follows:  the  parameter  a  =  1,  Co  =  1,  do  =  2,  and  Dirichlet  distributions  with 

uniform  parameters  3  =  [tj . Jf]  as  priors  on  the  words  distribution:  the  integer  M  defines  the  number 

of  possible  outcomes  concerning  the  occurrence  of  a  given  word  in  a  document,  and  this  is  detailed 
below  for  the  particular  examples.  We  ran  VB  until  the  relative  change  in  the  marginal  likelihood  bound 
[38]  was  less  than  0.01%.  For  the  LDA  and  TOT  model  initializations,  we  used  exchangeable  Dirichlet 
distributions  as  priors  on  word  probabilities  and  initialized  the  Dirichlet  hyper-parameters  for  the  topic 

mixing  weights  with  a  =  [t? . -r  .  The  truncation  level  was  set  to  K  —  50  topics  in  all  four  models. 

For  the  reasons  discussed  in  Section  XV,  the  dDTM  are  expected  to  be  insensitive  to  the  setting  of  I\, 
as  long  as  it  is  “large  enough”;  we  also  performed  studies  of  the  below  data  for  K  =  75  and  K  =  100, 
with  very  similar  results  manifested. 

A.  NIPS  Data  Set 

The  NIPS  (Neural  Information  Processing  Systems)  data  set  comprises  1.740  publications.  The  total 
number  of  unique  words  was  J  =  13.049.  The  observation  vector  xLi  corresponds  to  the  frequency 
of  all  words  in  paper  i  of  the  NIPS  proceedings  from  year  t.  We  set  the  total  number  of  outcomes  of 
the  multinomial  distributions  to  M  =  5;  m  =  1  corresponds  to  a  word  occurring  zero  times,  m  =  2 
corresponds  to  a  word  occurring  once  or  twice,  m  =  3  corresponds  to  a  word  occurring  between  three- 
five  times,  m  =  4  corresponds  to  a  word  occurring  between  six-ten  times,  and  w  =  5  corresponds  to  a 
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Fig  7  Histogram  of  the  rate  of  word  appearances  in  the  NIPS  data  set;  the  horizontal  axis  represents  the  number  of  times  a  given  word 
appears  in  one  document,  and  the  vertical  axis  quantifies  the  number  of  times  such  words  occurred  across  all  documents.  For  example,  in 
an  average  document,  there  will  be  95  words  that  appear  twice  From  this  we  note  that  most  words  rarely  occur  more  than  five  times  in  a 
given  document. 


word  occurring  more  than  ten  times  in  a  publication.  This  decomposition  was  defined  based  on  examining 
a  histogram  of  the  rate  with  which  any  given  word  appeared  in  a  given  publication  (see  Fig.  7). 

We  first  estimated  the  dDTM  posterior  distributions  over  the  entire  set  of  topics;  the  time  evolution 
of  the  posterior  dDTM  probabilities  for  four  representative  topics  and  their  ten  most  probable  words,  as 
computed  via  the  posterior  updates  of  words  distributions  within  topics,  are  shown  in  Fig.  8;  we  ran  the 
algorithm  20  times  (with  different  randomly  selected  initializations)  and  chose  the  VB  realization  with 
the  highest  lower  bound. 

We  then  selected  the  years  when  the  four  topics  represented  above  reached  their  highest  probability 
of  being  drawn  and  identified  associated  publications;  as  we  can  see  in  Table  I,  for  a  given  topic,  there 
is  a  strong  dependency  between  the  most  probable  words  and  associated  publications,  with  this  proving 
to  be  a  useful  method  of  searching  for  papers  based  on  a  topic  name  or  topic  identifying  words.  These 
representative  results  are  interpreted  as  follows:  Topics  A  and  C  appear  to  be  related  to  neural  networks 
and  speech  processing,  which  appear  to  have  a  diminishing  importance  with  time.  By  contrast.  Topics  B 
and  D  appear  to  be  related  to  more  statistical  approaches,  which  have  an  increasing  importance  with  time. 
The  specific  topic  label  is  artificially  given;  it  corresponds  to  one  indicator  variable  in  the  VB  solution. 

In  our  next  experiment,  we  quantitatively  compared  the  dDTM,  LDA,  TOT  and  DTM  models  by 
computing  the  perplexity  of  a  held-out  test  set;  perplexity  [20]  is  a  popular  measure  used  in  language 
modeling,  reflecting  the  difficulty  of  predicting  new  unseen  documents  after  learning  the  model  from  a 
training  data  set.  The  perplexity  results  considered  here  are  not  the  typical  held-out  at  random  type,  but 
real  prediction  where  we  are  using  the  past  to  build  a  model  for  the  future;  a  lower  perplexity  score 
indicates  better  model  performance.  The  perplexity  for  a  test  set  of  NUst  documents  is  defined  to  be 


/  i  test  A 


(42) 


where  xtesu  represents  the  document  i  in  the  test  set  and  ntest.i  is  the  number  of  words  in  document 
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Fig.  8.  Posterior  topic  probabilities  distribution  and  most  probable  words  for  NIPS  data  set,  as  computed  by  the  dDTM  model. 


TABLE  I 

Representative  topics  from  the  NIPS  database,  with  their  most  probable  words  and  associated  publications. 
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training 

networks 

input 


'A  Continuous  Speech  Recognition  System  Embedding  MLP  into  HMM' 
Training  Stochastic  Model  Recognition  Algorithms  as  Networks  can  Lead  to 


Topic  A 

(year  1989) 


Topic  B 

(year  1999) 


speech 

time 

rocognltion 

set 

stats 

number 

word 


Maximum  Mutual  Information  Estimation  of  Parameters' 

'Speaker  Independent  Speech  Recognition  with  Neural  Networks  and  Speech 
Knowledge' 

The  Cocktail  Party  Problem:  Speech/Data  Signal  Separation  Comparison 
between  Back  propagation  and  SONN4 


algorithm 

problem 

weights 

casa 

linear 

weight 


'Model  Selection  for  Support  Vector  Machines' 

'Uniqueness  of  the  SVM  Solution' 

'Differentiating  Functions  of  the  Jacobian  with  Respect  to  the  Weights 
'Transductive  Inference  for  Estimating  Values  of  Functions' 


data 

model 


error 


training 

*pike  'Models  of  Ocular  Dominance  Column  Formation  Analytical  and  Computational 

*visuai  Results' 


Topic  C 
(year  1988) 


ceil 

response 

synaptic 

function 

firing 

activity 


'Modeling  the  Olfactory  Bulbs  Coupled  Nonlinear  Oscillators' 

'A  Model  for  Resolution  Enhancement  (Hyperacuity)  in  Sensory  Representation' 
'A  Computationally  Robust  Anatomical  Model  for  Retinal  Directional  Selectivity' 


Topic  D 

(year  1999) 


output 


distribution 

gaussian 

algorithm 

modei 

set 

spaca 

information 

linear 

function 


'Local  Probability  Propagation  for  Factor  Analysis1 

'Algorithms  for  Independent  Components  Analysis  and  Higher  Order  Statistics4 
’Correctness  of  Belief  Propagation  in  Gaussian  Graphical  Models  of  Arbitrary 
Topology’ 

'Data  Visualization  and  Feature  Selection  New  Algorithms  for  Nongaussian  Data4 


number 


In  our  experiment  the  role  of  a  document  is  played  by  a  publication;  the  perplexity  results  correspond 
to  a  real  prediction  scenario,  where  we  are  using  the  past  to  build  a  model  for  the  future:  we  held  out  all 
the  publications  from  one  year  for  test  purposes  and  trained  the  models  on  all  the  publications  from  all 
the  years  prior  to  the  testing  year;  as  testing  years  we  considered  the  last  five  years  between  1995  and 
1999. 

The  perplexity  for  the  LDA  and  TOT  models  was  computed  as  in  [20];  for  the  dDTM  and  DTM  models 
it  was  computed  as  follows: 
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where  z  is  the  topic  indicator,  i  is  the  publication  index,  d  is  the  block/year  indicator,  T  is  the  total 
number  of  training  years,  and  do  is  the  hyper-parameter  of  the  beta  prior  distributions  Beta(  l.do)  on  the 
innovating  weights  {uy}^ 12.  The  perplexity  computation  for  the  dDTM  model  is  provided  in  Appendix 
B. 

Figure  9  shows  the  mean  value  and  standard  deviation  of  the  perplexity  of  dDTM,  LDA  TOT  and 
DTM  models  with  K  =  50  topics;  we  ran  20  VB  realizations  for  the  dDTM,  LDA  and  DTM  and  20 
MCMC  realizations  (with  1000  iterations  each)  for  the  TOT  model.  We  see  that  the  dDTM  model  slightly 
outperforms  the  other  models,  with  the  LDA  and  TOT  better  than  the  DTM.  The  improved  performance 
of  dDTM  model  is  due  to  the  time  evolving  structure;  the  order  of  publications  plays  an  important  role 
in  predicting  new  documents,  through  the  innovation  weight  probability  w,  as  can  be  seen  in  (43). 

While  the  NIPS  database  is  widely  used  for  topic  modeling,  the  relatively  small  number  of  years  it 
entails  mitigates  interesting  analysis  of  the  ability  of  dDTM  to  model  the  time-evolving  properties  of 
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Fig.  9.  Perplexity  results  on  the  NIPS  data  set  for  dDTM,  LDA.  TOT  and  DTM:  mean  value  and  standard  deviation 


documents.  This  motivates  the  next  example,  which  corresponds  to  a  yearly  database  extending  over  200 
years. 


B  State  of  the  Union  Data  Set 

The  State  of  the  Union  data  set  comprised  20,431  paragraphs,  each  with  a  time  stamp  from  1790  to 
2008.  The  observation  vector  xtj  corresponds  to  the  frequency  of  all  words  in  paragraph  i  of  the  State  of 
the  Union  from  year  t.  In  this  (motivating)  example,  “document”  i  for  year  t  corresponds  to  paragraph  i 
from  the  State  of  the  Union  for  year  t.  Therefore,  the  model  assumes  the  State  of  the  Union  is  represented 
by  a  mixture  of  topics,  and  within  dDTM  the  mixture  weights  evolve  with  time. 

After  removing  common  stop  words  by  referencing  a  common  list  which  can  be  found  at 
http  :  / /u'U'w.dcs. gla.ac.uk/idom/ir.resources/linguistic-utils/ stop.words,  and  applying  the  Porter 
stemming  algorithm  [39>,  the  total  number  of  unique  words  was  J  =  747.  In  the  rare  years  where  two 
state  of  the  union  addresses  were  given,  the  address  given  by  the  outgoing  president  was  used.  Similar 
to  the  NIPS  data  set,  each  paragraph  was  represented  as  a  datum,  a  vector  of  word  counts  over  the 
dictionary.  However,  to  match  the  data  structure,  we  set  the  number  of  possible  outcomes  as  M  =  2, 
indicating  whether  a  given  word  was  present  (m  —  1)  or  not  (m  =  2)  in  a  given  paragraph.  This  structure 
corresponds  to  a  binomial-beta  representation  of  the  words  distribution,  a  special  case  of  the  multinomial- 
Dirichlet  model  used  in  the  NIPS  experiment. 

For  our  first  experiment  we  estimated  the  posterior  distributions  over  the  entire  set  of  topics,  for  each 
of  the  three  models  mentioned  above.  Results  for  the  dDTM  model  are  shown  in  Fig.  10  for  the  time 
evolution  of  the  posterior  dDTM  probabilities  for  five  important  topics  in  American  history:  ‘American 
civil  war',  ‘world  peace’,  ‘health  care’,  ‘U.  S.  Navy’  and  ‘income  tax’:  similar  to  the  NIPS  experiment,  we 
ran  the  algorithm  20  times  (with  random  initialization)  and  chose  the  VB  realization  with  the  highest  lower 
bound.  The  topic  distributions  preserve  sharp  peaks  in  time  indicating  significant  information  content  at 
particular  historical  time  points.  It  is  important  to  mention  that  we  have  (artificially)  named  the  topics 
based  on  their  ten  most  probable  words.  The  corresponding  most  probable  words  are  shown  in  the  right 
hand  side  of  each  plot.  In  comparison,  the  dDTM  seems  to  perform  better  than  LDA,  TOT  and  DTM: 
‘American  civil  war’  and  ‘health  care’  are  topics  that  were  not  found  by  LDA,  TOT  or  DTM.  The  better 
performance  of  the  dDTM  model  can  be  explained  by  the  sharing  properties  that  exist  between  adjacent 
blocks,  properties  controlled  by  the  innovation  weight  w.  Figures  11,12  and  13  show  topic  distributions 
and  their  associated  ten  most  probable  words  for  the  LDA,  TOT  and  DTM  models,  respectively. 

Concerning  the  interpretation  of  these  results,  we  note  that  the  US  was  not  a  world  powrer  until  after 
World  War  II,  consistent  with  Fig.  10(a).  National  health  care  in  the  US  became  a  political  issue  in  the 
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Fig.  10.  dDTM  model  -  topic  probabilities  distribution  and  most  probable  words  for  State  of  the  Union  data  set.  (a)  World  peace,  (b)  health 
care,  (c)  U.S.  Navy,  (d)  income  tax.  (e)  Civil  War. 
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Fig.  1 1.  LDA  model  -  topic  probabilities  distribution  and  most  probable  words,  (a)  World  peace,  (b)  U.S.  Navy,  (c)  income  tax. 


early  and  mid  1990s,  and  continues  such  to  this  day.  The  US  Navy  was  an  important  defense  issue  from 
the  earliest  days  of  the  country,  particularly  in  wars  with  Britain  and  Spain.  With  the  advent  of  aircraft, 
the  importance  of  the  navy  diminished,  while  still  remaining  important  today.  Concerning  Fig.  10(d)  on 
taxation,  the  first  federal  laws  on  federal  (national)  income  tax  were  adopted  by  Congress  in  1861  and 
1862,  and  the  Sixteenth  Amendment  to  the  US  Constitution  (1913)  also  addressed  federal  taxation.  The 
heavy  importance  of  this  topic  around  1920  is  attributed  to  World  War  I,  with  this  becoming  an  important 
issue/topic  thereafter  (concerning  the  appropriate  tax  rate).  The  US  Civil  War,  which  had  a  heavy  focus 
on  “state  rights”  was  of  course  in  the  1860-1865  period,  with  state  rights  being  a  topic  of  some  focus 
sporadically  thereafter. 

Another  advantage  of  dDTM  over  LDA,  TOT  and  DTM  is  that  it  allows  us  to  analyze  the  dynamic 
evolution  of  topic  mixing  weights  through  innovation  probabilities.  For  that,  using  the  dDTM  model,  we 
examined  the  innovation  weight  probability  w,  for  each  year  from  1790  to  2008.  Table  II  shows  the  years 
when  the  mean  innovation  probability  was  greater  than  0.8,  the  year-period  description  and  the  name  of 
the  associated  president.  As  observed  during  those  years,  important  political  events  are  well  identified 
by  dDTM.  For  each  of  the  innovating  years  shown  in  Table  II  we  also  estimated  the  ‘most  innovative’ 
words  with  respect  to  their  previous  year.  For  example,  we  were  interested  in  finding  the  words  that 
caused  innovation  during  year  1 829.  For  that,  we  first  calculated  the  distribution  of  the  words  within  one 
year,  by  integrating  out  the  topics;  we  then  estimated  the  Kullback-Leibler  (KL)  divergence  between  the 
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Fig.  12.  TOT  model  -  topic  probabilities  distribution  and  most  probable  words,  (a)  World  peace,  (b)  U.S.  Navy,  (c)  income  tax. 


probabilities  of  a  given  word  belonging  to  two  consecutive  years,  1828  and  1829.  The  higher  the  KL 
distance  is  for  a  given  word,  the  more  innovation  it  produces;  the  ten  most  innovative  words  for  each  of 
the  years  of  interest  are  shown  in  Table  III. 

The  results  in  Table  II  ideally  (if  dDTM  works  properly)  correspond  to  periods  of  significant  change 
in  the  US.  Concerning  interpretation  of  these  results,  President  Jackson  was  the  first  non-patncian  US 
president,  and  he  brought  about  significant  change  ( e.g .,  he  ended  the  national  banking  system  in  the  US). 
The  Civil  War,  World  War  I,  World  War  II,  Vietnam  and  the  end  of  the  Cold  War  were  all  significant 
changes  of  “topics”  within  the  US.  Ronald  Reagan  also  brought  a  level  of  conservative  government  to 
the  US  which  was  a  significant  change.  These  key  periods,  as  inferred  automatically  via  dDTM,  seem  to 
be  in  good  agreement  with  historical  events  in  the  US. 

We  also  analyzed  the  ability  of  dDTM  to  group  paragraphs  into  topics.  We  chose  two  distinguishing 
years  in  American  history,  1861  (during  the  American  Civil  War)  and  2002  (post  terrorist  attacks)  and  show 
the  most  probable  three  topics  as  computed  via  the  VB  posterior  updates  and  their  associated  paragraphs 
(see  Tables  IV  and  V).  In  1861  the  three  major  topics  were  ‘political  situation’,  ‘finances’  and  ‘army’. 
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Fig.  13.  DTM  model  -  topic  probabilities  distribution  and  most  probable  words,  (a)  World  peace,  (b)  U.S.  Navy,  (c)  income  tax 

TABLE  II 

Years  with  the  mean  innovation  weight  probability  greater  than  0  8  in  the  dDTM  model,  year-period  description 

AND  THE  ASSOCIATED  PRESIDENT. 


Year 

Mean  innovation 
weight  probability 

Period  description 

President 

1829 

0.87 

Pres.  A  Jackson’s  era 

A  Jackson 

1831 

0.84 

1861 

0.82 

Civil  war 

A  Lincoln 

1909 

0.81 

Industrialization 

W.  H.  Taft 

1919 

0.85 

Post  “world  war  1”  era 

W.  Wilson 

1938 

0.84 

Roosevelt’s  economical 
recovery 

F.  D.  Roosevelt 

1939 

0.82 

Second  world  war 

F  D  Roosevelt 

1965 

0.8 

Vietnam’s  war 

L.  B.  Johnson 

1981 

0.81 

R.  Reagan’s  promised 
economic  revival  and  the 
recession 

R.  Reagan 

1982 

0.89 

1990 

0.82 

The  end  of  the  “cold  war” 

G.  H  W.  Bush 

TABLE  III 

Most  innovative  words  in  the  years  with  the  mean  innovation  weight  probability  greater  than  0.8 
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TABLE  IV 

Paragraph  clustering  analysis  for  year  1861:  top  three  most  probable  TOPtcs  and  their  associated  paragraphs. 


Topic  40 

Topic  41 

Topic  22 

Nations  thus  tempted  to  interfere  are  not 
always  able  to  resist  the  counsels  of 
seeming  expediency  and  ungenerous 
ambition  although  measures  adopted 
under  such  influences  seldom  fail  to  be 
unfortunete  and  injurious  to  those 
adopting  them. 

The  revenue  from  all  sources  including  loans 
for  the  financial  year  ending  on  the  10th  of 
June  was  and  the  expenditures  for  the  same 
period  including  payments  on  account  of  the 
public  debt  were  leaving  a  balance  in  the 
Treasury  on  the  1st  of  July  of 

1  respectfully  refer  to  the  report  of  the  Secretary  of  War  for 
information  respecting  the  numerical  strength  of  the  Army 
and  for  recommendations  having  In  view  an  increase  of  its 
efficiency  and  the  wellbeing  of  the  various  branches  of  the 
service  intrusted  to  his  care 

It  is  not  my  purpose  to  review  our 
discussions  with  foreign  states  because 
whatever  might  be  their  wishes  or 
dispositions  the  integrity  of  our  country 
and  the  stability  of  our  Government  mainly 
depend  not  upon  them  but  on  the  loyalty 
virtue  patriotism  and  intelligence  of  the 
people 

For  the  first  quarter  of  the  financial  year 
ending  on  the  30th  of  September  the  receipts 
from  all  sources  including  the  balance  of  the 
1st  of  July  were  and  the  expenses  leaving  a 
balance  on  the  1st  of  October  of 

The  large  addition  to  the  Regular  Army  in  connection  with 
the  defection  that  has  so  considerably  diminished  the 
number  of  its  officers  gives  peculiar  importance  to  his 
recommendation  for  Increasing  the  corps  of  cadets  to  the 
greatest  capacity  of  the  Military  Academy. 

Some  treaties  designed  chiefly  for  the 
interests  of  commerce  and  having  no 
grave  political  importance  have  been 
negotiated  end  will  be  submitted  to  the 
Senate  for  their  consideration. 

The  revenue  from  all  sources  during  the  fiscal 
year  ending  June  including  the  annual 
permanent  appropriation  of  for  the 
transportation  of  free  mail  matter  was  being 
ebout  12  per  cent  less  than  the  revenue  for 

It  is  gratifying  to  know  that  the  patriotism  of  the  people 
has  proved  equal  to  the  occasion  end  thet  the  number  of 
troops  tendered  greatly  exceeds  the  force  which  Congress 
authorized  me  to  call  Into  the  field 

whereas  in  2002  the  topics  were  ‘terrorism’,  ‘national  budget’  and  ‘overall  progress  of  the  country’.  In 
both  cases,  the  algorithm  automatically  clusters  the  paragraphs  using  what  appears  to  be  an  accurate  topic 
representation. 

To  show  the  dynamic  structure  of  dDTM,  we  selected  2002  as  a  reference  year  and  its  two  years  before 
and  after  as  years  where  topic  transition  could  be  manifested.  For  each  of  the  five  years,  we  estimated  the 
most  probable  topic  and  identified  its  associated  paragraphs.  As  we  can  see  in  Table  VI,  a  topic  transition 
is  manifested  during  this  time  interval:  if  in  2000,  the  main  topic  was  ‘economy’,  in  the  following  years 
attention  is  paid  to  ‘education’,  ‘terrorism’,  ‘economy’  again  and  ‘war  in  Iraq’,  respectively.  The  terrorist 
attacks  on  the  World  Trade  Center  and  on  the  Pentagon  occurred  in  2001,  manifesting  the  clear  change 
in  the  important  “topics”. 

Finally,  we  again  compared  dDTM,  LDA,  TOT  and  DTM  models  by  computing  their  perplexities;  in 
this  case,  the  role  of  a  document  was  represented  by  a  paragraph  and,  similar  to  the  NIPS  experiment,  we 
considered  the  task  of  real  prediction,  by  holding  out  all  the  paragraphs  from  one  year  for  test  purposes 
and  training  the  models  on  all  the  paragraphs  from  all  the  years  prior  to  the  testing  year;  as  testing  years 
we  considered  the  ending  year  of  each  decade  from  1901  to  2000. 

Figure  14  shows  the  mean  perplexity  of  dDTM,  LDA,  TOT  and  DTM  models  with  I<  =  50  topics  and 
10  testing  years.  We  ran  20  VB  realizations  for  the  dDTM,  LDA  and  DTM  and  20  MCMC  realizations 
(w'lth  1000  iterations  each)  for  the  TOT  model;  the  standard  deviation  values  are  included  as  well.  We 
see  that  the  dDTM  model  consistently  performs  better  than  the  other  models.  We  also  observe  that  LDA 


TABLE  V 

Paragraph  clustering  analysis  for  year  2002:  top  three  most  probable  topics  and  their  associated  paragraphs. 


39 


Topic  19 

Topic  2 

Topic  39 

America  has  a  window  of  opportunity  to 
extend  and  secure  our  present  peace  by 
promoting  a  distinctly  American 
internationalism.  We  will  work  with  our  allies 
and  friends  to  be  e  force  for  good  and  a 
champion  of  freedom.  We  will  work  for  free 
markets  free  trade. 

Government  cannot  be  replaced  by  charities  or  volunteers 
Government  should  not  fund  religious  activities.  But  our 
Nation  should  support  the  good  works  of  these  good 
people  who  are  helping  their  neighbors  In  need  So  1 
propose  allowing  all  taxpayers  whether  they  itemize  or  not 
to  deduct  their  charitable  contributions.  Estimates  show  this 
could  encourage  as  much  as  one  billion  a  year  in  new 
charitable  giving  money  that  will  save  and  change  lives. 

Together  we  are  changing  the  tone  in  the 
Nation  s  Capital.  And  this  spirit  of  respect 
and  cooperation  is  vital  because  in  the 
end  we  will  be  judged  not  only  by  what 
we  say  or  how  we  say  it  we  will  be 
judged  by  what  were  able  to  accomplish. 

Our  Nation  also  needs  a  clear  strategy  to 
confront  the  threats  of  this  century  threats 
that  are  more  widespread  and  less  certain 

They  range  from  terrorists  who  threaten  with 
bombs  to  tyrants  in  rogue  nations  intent  upon 
developing  weapons  of  mass  destruction  To 
protect  our  own  people  our  allies  and  friends 
we  must  develop  and  we  must  deploy 
effective  missile  defenses 

1  propose  we  make  a  major  investment  in  conservation  by 
fully  funding  the  Land  and  Water  Conservation  Fund  and 
our  national  parks.  As  good  stewards  we  must  leave  them 
better  than  we  found  them.  So  1  propose  to  provide  one 
billion  over  ten  years  for  the  upkeep  of  these  national 
treasures. 

The  last  time  1  visited  the  Capitol  1  came 
to  take  an  oath  on  the  steps  of  this 
building.  1  pledged  to  honor  our 

Constitution  and  laws  and  1  esked  you  to 
join  me  in  setting  a  tone  of  civility  and 
respect  in  Washington.  1  hope  America  is 
noticing  the  difference  because  we’re 
making  progress 

Yet  the  cause  of  freedom  rests  on  more  than 
our  ability  to  defend  ourselves  and  our  allies. 
Freedom  is  exported  every  day  as  we  ship 
goods  and  products  that  improve  the  lives  of 
millions  of  people  Free  trade  brings  greater 
political  and  personal  freedom.  Each  of  the 
previous  five  Presidents  has  had  the  ability  to 
negotiate  far  reaching  trade  agreements. 

The  budget  adopts  a  hopeful  new  approach  to  help  the 
poor  and  the  disadvantaged  We  must  encourage  and 
support  the  work  of  charities  and  faith  based  and 
community  groups  that  offer  help  and  love  one  person  at  a 
time.  These  groups  ere  working  In  every  neighborhood  in 
America  to  fight  homelessness  and  addiction  and  domestic 
violence  to  provide  a  hot  meal  or  a  mentor  or  a  safe  haven 
for  our  children.  Government  should  welcome  these  groups 
to  apply  for  funds  not  discriminate  against  them 

Neither  picture  is  complete  In  and  of 
itself  Tonight  1  challenge  and  invite 
Congress  to  wort*  with  me  to  use  the 
resources  of  one  picture  to  repaint  the 
other  to  direct  the  advantages  of  our  time 
to  so've  the  problems  of  our  people 

Some  of  these  resources  will  come  from 
Government,  some  but  not  all. 

TABLE  VI 

Dynamic  structure  analysis  for  years  2000-2004:  most  probable  topic  and  associated  paragraphs. 


Year  2000  (topic  37) 

Year  2001  (topic  12) 

Year  2002  (topic  19) 

Year  2003  (topic  37) 

Year  2004  (topic  34) 

We  begin  the  new  century  with 
over  one  million  new  jobs:  the 
fastest  economic  growth  in 
more  then  ten  years:  the  lowest 
unemployment  rates  in  yeers. 
the  lowest  poverty  rates  in 
yeers:  the  lowest  African 

American  end  Hispanic 
unemployment  retes  on  record 
America  will  achieve  the 
longest  penod  of  economic 
growth  in  our  entire  history.  We 
have  built  e  new  economy 

A  budget's  imped  is  counted  in 
dollars  but  measured  in  lives 
Excallent  schools  quality  health 
care  e  secure  retirement  e  deener 
environment  e  stronger  defense, 
these  ere  ell  important  needs  end 
we  fund  them  The  highest 
percentage  increese  in  our  budget 
should  go  to  our  children's 
education.  Education  is  my  top 
pnortty  end  by  supporting  this 
budget  you’ll  make  it  yours  es  wall 

Amence  has  e  window  of 
opportunity  to  extend  end  secure 
our  present  peace  by  promoting  e 
distinctly  Amencen 

Internationalism.  We  wdl  work  with 
our  ellies  and  friends  to  be  e  force 
fbr  good  end  e  champion  of 
freedom  We  will  work  for  free 
markets  free  trade. 

To  lift  the  stenderds  of  our  public 
schools  we  achieved  historic 
education  reform  which  must  now 
be  earned  out  in  every  school  end  In 
every  classroom  so  thet  every  child 
in  America  can  read  end  leem  end 
succeed  in  life  To  protect  our 
country  we  reorganized  our 
Government  end  created  the 
Department  of  Horn  ©lend  Security 
which  is  mobilizing  egeinst  the 
threats  of  e  new  ere  To  bnng  our 
economy  out  of  recession  we 
delivered  the  largest  tax  relief  in  e 
generation 

We  have  faced  serious  challenges 
together  end  now  we  face  e  choice 

We  can  go  forward  with  confidence 
end  resolve  or  we  can  turn  back  to 
the  dangerous  illusion  thet  terrorists 
ure  not  plotting  end  outlaw  regimes 
ere  no  threat  to  us.  We  can  press  on 
with  economic  growth  end  reforms 
in  education  end  Medicare  or  wa 
can  turn  back  to  old  policies  end  old 
divisions. 

Our  economic  revolution  hes 
been  metched  by  e  revival  of 
the  Amencan  spint  crime  down 
by  percent  to  its  lowest  level  in 
yeers  teen  births  down  yeers  in 
e  row  edoptions  up  by  percent 
weifere  rolls  cut  in  half  to  their 
lowest  levels  in  yeers 

When  it  comes  to  our  schools 
dollars  elone  do  not  always  make 
the  difference.  Funding  is 
important  end  so  is  reform.  So  we 
must  tie  funding  to  higher 
standards  end  accountability  for 
results 

Our  Notion  also  needs  e  deer 
strategy  to  confront  the  threats  of 
this  century  threats  thet  era  more 
widespread  end  less  certain.  They 
range  from  terrorists  who  threaten 
with  bombs  to  tyrants  in  rogue 
nations  intent  upon  developing 
wee  pons  of  mess  destruction.  To 
protect  our  own  people  our  allies 
end  friends  we  must  develop  and 
we  must  deploy  effective  missile 
defenses. 

Our  first  goal  is  eleer  we  must  heve 
en  economy  that  grows  fast  enough 
to  employ  every  men  end  woman 
who  seeks  e  job  After  recession 
terrorist  ettacks  corporate  scandals 
end  stock  market  declines  our 
economy  is  recover ng  Yet  its  not 
growing  fast  enough  or  strongly 
enough  With  unemployment  nsmg 
our  Notion  needs  more  smet! 
businesses  to  open  more 
companies  to  invest  end  expand 
more  employers  to  put  up  the  sign 
thet  says  Help  Wanted 

Hevmg  broken  the  Baethist  regime 
we  face  e  remnant  of  violent 

Saddem  supporters  Men  who  ren 
eway  from  our  troops  in  bettle  ere 
now  dispersed  end  attack  from  the 
shedows  These  killers  joined  by 
foreign  terrorists  ere  e  serious 
continuing  denger.  Yet  we  re  making 
progress  egeinst  them  The  once  ell 
powerful  ruler  of  Iraq  wes  found  In  e 
hole  end  now  sits  in  e  prison  cell. 

The  top  officials  of  the  former 
regime  we  heve  captured  or  killed 

Our  forces  ere  on  the  offensive 
leedmg  over  patrols  e  dey  end 
conducting  en  everege  of  reids  e 
week 

Eight  years  ngo  it  was  not  so 
deer  to  most  Americans  there 
would  be  much  to  celebrate  in 
the  yeer.  Then  our  Net>on  was 
onpped  by  economic  distress 
soctel  dedine  political  gridlock. 
The  title  of  e  bestselling  book 
esked  America  What  Went 
Wrong. 

Schools  will  be  given  e  reasonable 
chance  to  improve  end  the  support 
to  do  so  Yet  if  they  don't  if  they 
continue  to  fell  we  must  give 
parents  end  students  different 
options  e  better  public  school  a 
private  school  tutoring  or  e  charter 
school.  In  the  er.d  every  child  in  a 
bad  situation  must  be  given  e 
better  choice  because  when  it 
comes  to  our  children  failure  is 
simply  not  en  option. 

Yet  the  cause  cf  freedom  rests  on 
more  than  our  ability  to  defend 
ourselves  end  our  elites  Freedom 
is  exported  every  day  es  we  ship 
goods  end  products  thet  improve 
the  lives  of  millions  of  peopin  Free 
trade  brings  greeter  political  end 
personal  freedom.  Each  of  the 
previous  five  Presidents  has  had 
the  ebility  to  negotiate  fer  reaching 
trade  agreements 

A  growing  economy  end  e  focus  on 
essential  priorities  will  be  cruoel  to 
the  future  of  Soael  Secunty  As  we 
continue  to  work  together  to  keep 
Soda!  Secunty  sound  end  reliable 
we  must  offer  younner  workers  e 
chance  to  invest  in  retirement 
accounts  that  they  wilt  control  end 
they  will  own. 

As  democracy  takes  hold  in  Iraq  the 
enemies  of  freedom  will  do  all  in 
their  power  to  spread  violence  end 
feer  They  ere  trying  to  «heke  the 
will  of  our  country  end  our  friends 
but  the  United  States  of  Amence  will 
never  be  intimideted  by  thugs  and 
assassins  The  killers  will  fail  end 
the  Iraqi  people  wi*  live  in  freedom 
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Fig.  14.  Perplexity  results  on  the  United  States  presidential  State  of  the  Union  Address  for  dDTM.  LDA.  TOT  and  DTM.  mean  value  and 
standard  deviation,  estimated  from  20  randomly  initialized  VB  realizations. 


and  TOT  slightly  outperform  the  DTM  model  due  to  the  Dirichlet  distribution  approximations  made  in 
the  DTM  model. 

Concerning  computational  costs,  all  code  was  run  in  MatlabTA/  on  a  PC  with  Intel  2.33GHz  processor. 
For  the  NIPS  data  dDTM,  LDA,  TOT  and  DTM  required  (for  each  VB  and  MCMC  runs)  4  hours  and 
1 6  minutes,  3  hours  and  22  minutes,  1 0  hours  and  3 1  minutes,  and  3  hours  and  45  minutes,  respectively. 
For  the  State  of  the  Union  data  these  respective  times  were  25,  22,  104  and  23  minutes.  These  times  are 
meant  to  give  relative  computational  costs;  none  of  the  software  was  optimized. 

XIX.  Topic  Modeling  Summary 

We  have  developed  a  novel  topic  model,  the  truncated  dynamic  HDP,  or  dDTM,  to  analyze  topics 
associated  with  documents  with  known  time  stamps.  The  new  model  allows  simple  variational  Bayesian 
(VB)  inference,  yielding  fast  computation  times.  The  algorithm  has  been  demonstrated  on  a  large  database, 
the  US  State  of  the  Unions  for  a  220  year  period,  and  the  results  seem  to  be  able  to  highlight  significant 
events  in  the  US  history  (although  it  should  be  emphasized  that  the  authors  are  not  historians,  and  much 
further  testing  and  evaluation  is  required).  The  algorithm  is  able  to  identify  important  historical  topics, 
as  well  as  periods  of  time  over  which  significant  changes  in  topics  are  realized.  The  model  compares 
favorably  with  LDA,  TOT  and  a  simplified  form  of  dDTM  (for  which  time  dependence  is  ignored). 

Concerning  future  research,  other  approaches  that  might  be  considered  for  approximate  inferences 
include  collapsed  sampling  [40].  It  would  be  interesting  to  analyze  how  these  different  inferences  influence 
the  overall  performance  of  the  model.  In  order  to  capture  semantics  evolution  with  time,  one  may  consider 
a  similar  dynamic  model  for  topics  themselves.  This  could  be  accomplished  by  allowing  the  words 
distributions  change  in  time;  for  identifiability,  constraints  could  be  used  so  that  the  majority  of  words  in  a 
topic,  and  their  associated  frequencies,  remain  constant  across  time.  In  addition,  the  evolution  of  the  model 
occurred  in  only  one  dimension  (time).  There  may  be  problems  for  which  documents  may  be  collected 
at  different  geographical  locations,  for  example  from  different  cities  across  the  world.  In  this  case  one 
may  have  spatial  proximity  as  well  as  temporal  proximity  to  consider,  when  considering  inter-document 
relationships.  It  is  of  interest  to  extend  the  dynamic  structure  from  one  dimension  to  perhaps  a  graphical 
structure,  where  the  nodes  of  the  graph  may  represent  space  and  time. 

We  also  note  there  may  be  general  interest  within  topic-model  research  in  representing  a  draw  from 
a  Dirichlet  process  in  the  form  in  (27).  While  this  increases  the  complexity  of  the  analysis,  it  has  the 
significant  advantage  of  allowing  one  to  place  a  Gamma  prior  on  a  and  perform  full  VB  inference  (we 
no  longer  have  to  set  a).  As  discussed  in  Section  XV,  a  plays  an  important  role  in  defining  the  number 
of  expected  topics  per  document  (since  it  controls  the  number  of  important  mixture  weights).  One  may 


J 


41 


place  a  separate  prior  on  the  distinct  a  associated  with  each  document,  so  that  the  number  of  important 
topics  per  document  may  change.  The  complication  with  doing  this,  rather  than  just  directly  drawing  from 
Dir(a/K ,  •  •  •  ,  o/K)  is  that  one  must  now  perform  inference  on  many  more  parameters  (on  the  sticks 
of  the  stick-breaking  representation).  In  some  applications  such  added  complexity  will  be  warranted  by  a 
desire  to  infer  a  in  a  full  VB  analysis. 
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